(Originally posted 2011-09-19.)
It’s been a week since the following was posted in IBM-MAIN: Batch Capacity Planning – BWATOOL? So far there’s been no reply. Though a little disappointed, I’m not surprised. "Disappointed" as I was looking for a good debate (even though it wasn’t me who asked the question). "Not surprised" as I think the subject of Batch Capacity Planning is a tough one. The original post prompted me to think about posting on the subject. I think I’ll do it in two parts:
An obvious place to start is by comparing and contrasting Batch with Online, from the Capacity Planning point of view. This, of course, builds on the Performance / Window perspective:
- Online comprises discrete (maybe discreet) 🙂 pieces of work, seemingly unconnected. Batch work – at least the stuff we tend to care about – is a network of inter-related pieces of work.
- Online pieces of work (transactions) are brief relative to any summarisation interval you’re likely to use. Batch jobs, while many are very short, are often long compared to an interval. Although I’ve seen windows where the jobs are at most 15 minutes most Batch has key jobs in it that are much longer than this (the standard RMF) interval.
- Online transactions (and their kin) are "scheduled" by being requested. Production Batch tends to be kicked off by a scheduler. (The "and their kin" parenthetic comment refers to the fact that many workloads are transaction-like, such as many styles of DDF requests.)
Those contrasts aren’t exhaustive but they are enough. We’ll use them to inform the rest of this post.
But there is a similarity that’s worth articulating: For both Batch and Online "enough CPU" refers to "what gets the job done": If Online work fails to meet Service Level Agreements / Expectations / Pious Hopes 🙂 or whatever you conclude something has to be done. Similarly, if important Batch fails to meet its business goals there’s pressure to do something. (This post isn’t going to go into the business drivers or the shape of SLAs.)
When I look at the CPU Utilisation for a Batch Window I typically see a huge amount of variability, both within the night and from night to night. This, I surmise, is caused by the "big lumps" characteristic above. And if you try to figure out which job caused a spike it’s hard to do automatically – because of the "interval straddling" characteristic above. But usually it’s fairly obvious – if the number of jobs running is not too large – which job is likely to have caused a spike.
It also helps if you have a decent WLM Batch service class (and, hopefully, reporting class) scheme: You can identify which service class caused the spike. And thence the list of candidate jobs could be shorter.
WLM setup helps in another way: Assuming you have a sensible hierarchy of Batch service classes you can establish whether the supposedly-more-important jobs’ service classes are experiencing delays for CPU (from the WLM delay samples in RMF). With this view you can take a "squint at the picture" * approach – smoothing out the spikes to some degree. In fact summarising over a longer interval than you might normally could be useful.
I think you have to accept that some degree of delay is inevitable at times with spiky work like Batch. Even for the "top dog" Batch service classes. The question is "how much?" If you calculate WLM velocity for these service classes over a long enough interval and the work of the window is just completed, maybe that’s a useful metric and threshold. When the velocity drops below a certain level the window’s work might just fail to get done in time.
I appreciate the previous paragraph is a little vague: It’s trying to impart an approach to learning how your Batch works – from the CPU perspective. The "inevitable at times" phrase might be a little controversial: Certainly if your Online day drives the CPU capacity requirement you stand less of a chance of seeing CPU delays of any note in the Batch. But for many installations that’s not true: The Batch drives the CPU requirement (or, in some cases, drives the Rolling 4-Hour Average and hence the software bill).
I haven’t used the "job network" characteristic in this post yet. So here are a couple of areas I think it plays in:
- It stands to move the spikes around: When a job gets elongated or delayed things move in the window. That could be the job itself being extended or downstream jobs being delayed.
- Growth is as likely to express itself in delays to jobs’ starting and completion as in increasing CPU requirements. Growth is mainly about "more data", though Development might add function so each input datum "gets more attention". 🙂
These two are related, I think. And they’re both about topology.
A couple of other things:
- Given the complexity of Batch it’s very difficult to predict what the effect of tuning is – on either run times or capacity requirements.
- It’s also very difficult to figure out how the "more engines or faster?" debate plays. "More and faster" is a clear winner (or at least non-loser). Some parts of the night will favour more. Some others will favour faster. Again the "squint at it" * technique is reasonable: If generally the answer is more, though sometimes it’s faster, probably more wins out. But apply common sense: Optimisation in one place causing de-optimisation in another requires care.
I’ll admit this whole area is a tough one. I’d be interested in what customers do for Batch Capacity Planning – or indeed whether it drives their overall plan.
And soon I’ll write about Memory from the Batch Capacity Planning perspective.
* When I say "squint at it" I mean "use a technique that takes the spiky detail out, leaving an overall (if blurred) picture. I’ve used the term for many years. People don’t look at me oddly when I say it so I assume they know what I mean. 🙂