New CPU Information In SMF Type 30 Records

(Originally posted 2012-10-10.)

Round about now you’d be expecting posts to be geared towards the recent zEC12 announcement, or perhaps CICS TS 5.1 or the DB2 11 Preview, or IDAA V3. So what this post is about will probably have slipped by unnoticed. After all you don’t spend all your time looking for obscure New Function APARs, do you? 🙂

But I think some of you will find this one of value, or at least quite interesting.

(I presented a slide on this at the UKCMG 1-day meeting, October 10 2012, so you might consider this to be the script for that slide.)

I could’ve given this post a provocative title like "Do You Really Need CICS PA¹?" and you’ll see why that’s an only slightly daft question to ask in a minute.

Single TCB (task) speed has always been an important topic, and continues to be so. Here are three examples of why, irrespective of processor technology:

For CICS regions much of the work is still performed on the QR (Quasi-Reentrant) TCB. There’s one per CICS region and when it’s saturated the region can support no more throughput: Installations are usually forced to split the affected regions.
For CPU-bound batch jobs there is usually a single TCB conditioning their speed: To make them go faster takes application code tuning, removal of queueing or a faster processor.
For complex CPU-bound DB2 queries the picture is similar to that of CPU-bound batch jobs. But here CPU Query Parallelism might help.

And these are just the most obvious examples, which we all know and love.

There’s an industry trend that’s beginning to make this even more important: Although the zEC12 had a very healthy single-processor speed increase over the z196, this is not a long-term trend. Processors of all architectures are getting faster more slowly, and this probably isn’t going to change. All architectures are relying on more engines and more threads to support larger workloads and zEC12 upping the limit to 101 engines from 80 is a good example of that.

So it behoves us to understand the single-TCB proclivities of our workloads, for all four reasons, and more.

A key point is, of course, what to do about it. But this post introduces some new instrumentation that at least helps with the analysis.

APAR OA39629, available for z/OS Releases 12 and 13, has the title "New Function To Report The Highest Percent Of CPU Time Used By A Single Task In An Address Space".

It provides two new fields in SMF 30 Interval (subtypes 2 and 3) and Step-End (subtype 4) and Job-End (subtype 5) records:

CPU % of the highest CPU consuming task.
This task’s program name.

For Step- and Job-end records the CPU % is highest percentage among the intervals during the running of the job or step.

The rest of this post is slightly speculative, as I’ll confess I haven’t actually seen SMF 30 records with the new fields in yet.

When there is no CPU you get blanks for the program name. If the program can’t be determined you get ‘????????’.

Let’s return to CICS: Consider the following diagram

This depicts a CICS region, though not a wholly typical one. As depicted, I would expect for most CICS regions the QR TCB to be the biggest. I don’t know whether the program name will actually be "DFHSIP" but I would expect it to be mnemonic and it’ll probably start with "DFH". If this is right we have a ready way in Type 30 to figure out how big the QR TCB is and therefore whether it is an impending constraint. And we can do this without creating CICS Statistic Trace records.² I mentioned Type 30 records and QR TCB in He Picks On CICS without a solution to the question of how to distinguish QR TCB from the rest.

The diagram also shows a File-Control TCB (think “VSAM”), three MQ TCBs and four DB2 TCBs. A typical region wouldn’t have all these doing much, if indeed they were present. And showing this level of evenness would, I’d hazard, be unusual.

For CICS regions with a heavy DB2 component, for example, the QR TCB might not be the biggest TCB³. In this case we’ll see a different program name and we can provide an upper bound on the QR TCB %. We’d do this by subtracting the biggest TCB (whatever that is) from the headline TCB time – also in the Type 30 (with some adjustments to make the maths right)⁴.

Of course CICS PA and the standard DFHSTUP (CICS Statistics Utility Program), which prints CICS TCB percentages at the subsystem level, do far more than just reporting CPU at the transaction instance and region level. But sometimes all you need is to figure out if the QR TCB is a vulnerability. In fact if you do have a region that needs work you probably would drill down (using something like CICS PA).

The above would use the Interval records (subtypes 2 and 3) and the same approach could be used with any long-running address space⁵. But there’s value for batch jobs – using subtypes 4 and 5. Admittedly most jobs are single-tasking. But not all are: For example, DB2 Utilities can be significant multitaskers⁶. So there is some value in finding the biggest TCB (and subtracting from the "headline" TCB number): You can better assess the benefit of faster engines (or understand which job steps are susceptible to engines not getting much faster).

So, I’m really looking forward to seeing real customer data – which I’m convinced will be very interesting.

I can’t believe it’ll be long before I see some. And when I do I’ll write some more about it.

Notes:

CICS Performance Analyzer, which produces reports based on SMF 110 Monitor Trace records. i.e. at the Transaction ID or transaction instance level.
I think this is actually quite significant: While I see customers with only a few CICS regions I see others with tens, hundreds or (in a few cases) thousands of CICS regions. Turning on Statistics Trace for a large subset of regions, particularly with a sensible Statistics Interval, is cumbersome if you’re just monitoring QR TCB %. Triaging them into "fix now", "monitor growth" and "don’t worry" regions is something you’d like to do with SMF 30 rather than SMF 110. After all most customers have SMF 30 Interval records permanently switched on.
I think this is a little unlikely: A CICS region will typically have multiple DB2 (or MQ) Attach TCBs, each of which would normally be quite small. So I’d still expect the QR TCB to be the biggest.
Another approach might be to use the fine structure of the Usage information I mentioned in Another Usage Of Usage Information. But the boundaries of usage might not completely align with TCBs. This is an experiment yet to be performed.
But many address spaces are architected so as not to have a "dominant" TCB, and indeed CICS has moved that way.
You could use SMF 101 Accounting Trace to see some elements of multi-tasking but I’d hope you wouldn’t have to.

New CPU Information In SMF Type 30 Records

Published by Martin Packer

One thought on “New CPU Information In SMF Type 30 Records”

Leave a comment Cancel reply

Share this:

Published by Martin Packer

One thought on “New CPU Information In SMF Type 30 Records”

Leave a comment Cancel reply