(Originally posted 2011-06-29.)
This may be stating the obvious – but I wonder to whom it actually is obvious…
I’ve been doing quite a lot of work with batch job timings and CPU recently. (Everything I’m about to say is equally true of steps.) It’s interesting to think about the effects of faster engines versus more engines (a question I haven’t been asked recently) and whether a customer needs more capacity or just faster engines (a question that has come into play).
There’s nothing very new about this but it is worth thinking about. And in particular what we can glean from SMF 30 Step- and Job-end records…
We know lots of things about a job’s timing and related stuff, most notably:
- When it started and ended, and hence the elapsed time.
- Where it ran.
- When it was read in and initiator delays.
- How much CPU it used – whether TCB or SRB.
- Whether it used a zIIP or zAAP or was eligible to but didn’t.
- How many disk or tape EXCPs it did (and how many tape mounts).
- Step condition codes.
But one thing we can’t know is whether a job met significant CPU queuing or not. At least not from Type 30. We might be able to infer something from DB2 Accounting Trace (SMF 101) but (to rehearse previous arguments):
- We can’t know if the Unaccounted For time (the most likely bucket) really is queuing time. (Generally it is but there are lots of edge cases where it’s something else.)
- We can encounter difficulties in tying up the 101 with the 30 – particularly for IMS.
- We have to take into account CP Query Parallelism.
- The job might not be DB2 at all.
Given we can’t readily establish the amount of time a job queued for CPU for it becomes difficult to establish whether more capacity would help the job. But there is some hope: "I have no idea" is probably not the best we can do:
- The job will be part of a WLM Service Class and so will – if the Service Class period has a velocity goal – have Delay For CPU samples. This is extremely broad brush but can tell you if a high CPU job is likely to suffer from queuing.
- If the EXCP count is high we can infer that a big chunk of the job’s time is for I/O.
- Variability of run time for similar CPU times and EXCP counts suggests the job sometimes gets held up for something.
So, suppose we kept capacity the same and, say, doubled the engine speed. Consider the case of a job that was 30% CPU. And ignore the n-way queuing effects of going from 2n engines to n. Then we might hazard the CPU time would halve and the elapsed time go down therefore by 15%.
Suppose we eliminated queuing (by and large). We can’t really say what the effect would be – except in our example job’s case it’s some fraction of the 70% that isn’t CPU. With the EXCP count we can "hand wave" a number, but it’s precisely that: a hand wave.
Some of the above is part of why I don’t like to give elapsed time speed up estimates. And why I’m not overly keen on answering the "faster versus more engines" question for batch. It’s actually worse than I’ve stated because individual job speed is hard to factor into the overall window’s outcome when migrating to a newer processor. (This same problem applies no matter which speed up you apply to individual jobs and steps, of course.)
But the "unknowable CPU queuing" fact plays into how to interpret other facts like (as in our example) the job is "only" 30% CPU. We don’t know whether it would’ve been 100% CPU without queuing or 30%. (Probably somewhere in between.) But we can use EXCP count, as I said, to help us guess.
For what it’s worth I rarely see jobs much above 30% or much below 10% CPU. I’d say it was a bell curve around 20%. If true, this means most of the leverage is either in I/O time or CPU queuing. Though as a contrary data point I hear quite frequently of customers upgrading to faster engines and seeing their overall batch gets faster.
Welcome to my world of "it depends".:-)