(Originally posted 2011-05-24.)
When tuning DB2 batch it’s important to know which SMF 101 Accounting Trace record corresponds to which job step.
A few years ago I wrote code to do this. It works fine for all z/OS DB2 Batch except that originated by IMS. Here’s how it works:
- Find all the Type 30 Step-End records for a given job name.
- Find the Type 101 Accounting Trace records for which the Correlation ID in the 101 record matches the job name. (First 8 bytes of Correlation ID is the job name for most DB2 Batch.)
- You match up the records from steps 1 and 2: The SMF ID must match and the start timestamp (QWACBSC) and stop timestamp (QWACESC) in the 101 slot into the time range the Type 30 is for. This gives you the running of the job and the step. The latter is nice because SMF 101 doesn’t know anything about job steps.
But there’s a small problem with this: QWACBSC and QWACESC use GMT and the Type 30 timings use local time. (In the experiment below the local time is 2 hours ahead of GMT: Mainland Europe on Summer Time.) So you have to adjust QWACBSC using the timezone offset. But SMF 101 doesn’t have the timezone offset in it. Except it does. If you compare the SMF record timestamp to QWACESC you get the timezone offset (as QWACESC is GMT).
All the above has worked fine. But now I have a customer with IMS Batch. And for IMS Batch the Correlation ID isn’t the job name. So what can you do?
It turns out that for long-running IMS/DB2 Batch you can still do the timestamp comparison trick – so long as you don’t attempt to match on Correlation ID. For short-running jobs it’s not likely to work as well – because there will be too many potential matches. Fortunately I care rather less about these cases.
Here’s an example match:
- SMF 30: The job name is AFRF032B and it runs on SYSA for 4904 seconds, consuming 289 seconds of CPU.
- SMF 101: There was an IMS batch job (PSB name "AFRFD2D") which ran on SYSA for 4899 seconds, consuming 288 seconds of CPU.
So far so good but they might be different jobs. Let’s examine their timestamps (adjusted for timezones). First start times:
- SMF 30: 11 APR 06 00:31:12.39
- SMF 101: 11 APR 06 00:31:15.93
So we took 3 seconds to get started. Sounds reasonable.
Now end times:
- SMF 30: 11 APR 06 01:52:56.63
- SMF 101: 11 APR 06 01:52:55.40
So we cut the 101 record about a second before we ended the step. Again reasonable.
The bracketing is nice here and I think we have a match.
So now we can proceed to figure out why the step took as long as it did by using the other numbers in SMF 101 to explain the time:
- The Class 1 time not in Class 2 is 4899 – 4608 seconds = 5 minutes. So not much leverage in working on non-DB2 stuff. (For CPU it’s 288 – 273 = 15 seconds.)
- Of the 4608 seconds 2699 are accounted for – 59%.
- The majority of that is in Write To Log time (1450 seconds) but also Lock/Latch Wait time (651) seconds. So clearly some issues here to sort out.
- There’s very little database I/O time, surprisingly (about 250 seconds).
The unaccounted for time (the other 41%) is actually not surprising to me: It turns out this job runs in a low-velocity WLM service class on a busy system so the usual theory (that unaccounted for time is CPU Queuing) seems reasonable. It’s not definitive as there are lots of other components of unaccounted for time, but it’s a reasonable theory.
The last piece -where the time is going – is meant to give you some idea of why you’d want to find the 101 Accounting Trace records for an IMS/DB2 job step you’re interested in tuning. Of course I haven’t delved down into which DB2 package is the one where the Lock/Latch Wait time is most prevalent – but that can be done from the Class 7 and 8 Accounting traces. Nor have I gone anywhere near the SQL.
But this shows you how you can make a reasonable start – even for IMS/DB2. A nice result – having thought for a number of years it couldn’t be done.