Bursty Batch

Bursty batch is quite common. For example, a customer I’m dealing with right now kicks off a burst of batch at 7PM and another burst at 10PM. I doubt that customer is reading this blog post. Another customer has a burst of batch kicking off at 2AM. They probably will read this post. But their operational security is assured: This is quite common. 😃

It’s worthwhile thinking about how this comes to be:

  • In the abovementioned cases there are business reasons for the release of batch at specific times. In their case instructions from external actors.
  • The ending for the day of a CICS service is another example – which might be a bounce to let batch run and then pick up new files.
  • Some prerequisite operation completes.
  • Some arbitrary definition of when the batch starts.

In any case a lot of work suddenly can run. But should it?

The temptation is to let it all in. Possibly motivated by the necessity to make it run as quickly as possible. But this is not consequence free: It can lead to thrashing.

CPI As An Indicator Of Thrashing

If we throw too much work in at once you might expect thrashing of CPU elements, such as the cache hierarchy.

This, for one, can lead to a typical instruction taking longer. I hope it’s obvious to you that cache misses cause CPU cycles while the data is fetched. Even cache hits serviced from another drawer can take a few hundred cycles. These are wasted cycles. Now, whether this leads to elongated run times is another matter. Suffice it to say an increase in CPU time for a job makes it more prone to queueing – which can lead to even more cache-related wasted cycles.

Wasted cycles might have a financial impact. With older software licencing schemes, based around the peak rolling four hour average GCP CPU, it’s quite common to see the batch driving the cost. And quite often soft capping is involved – which stands to elongate things further.

SMF 113 includes two useful counters – at the logical processor level: Instructions Executed and Cycles While Executing Instructions. These are in the Basic Counter Set and have been there since z10 (i.e. the beginning). So you certainly can perform the calculation: Cycles Per Instruction (CPI) is Cycles While Executing Instructions divided by Instructions Executed.

(Don’t quote me but) I’m seeing CPI typically in the 2 to 4 range. I say “don’t quote me” because it depends on a lot of things, including processor generation but also LPAR design and workload. In all the customers I’ve ever seen there’s been a daily cycle (pardon the pun) that CPI is observed to follow.

By the way, if the LPAR gets busy it might cause unparking of Vertical Low (VL) logical processors- and work running on those will almost certainly exhibit a higher CPI than Vertical High (VH) and Vertical Medium (VM) logical processors. Bursty work could well do that. Which sometime explains why I see spikes in CPI, usually at the same time each day.

SMF 113 is typically recorded on the 30-minute SMF interval. You’d think that is far too long to capture bursty batch. But note:

  • Severe burstiness would “move the needle” – even if there were, say, 15 minutes of it. Conversely, you might consider it not severe if there was little trace of it.
  • If you see – in SMF 113 – a spike in CPI you can bet the actual spike was much worse.

I wouldn’t recommend you drop the SMF interval, hoping to capture such things better. That’s the sort of thing you leave to SMF 98.

But CPI is not the only indicator. You might see lots of other evidence, such as:

  • CPU Queuing, or zIIP-on-CP. This would be at the service class period level – in SMF 72-3.
  • Locking, buffer pool misses, etc in Db2 Accounting Trace (SMF 101).
  • Unexplained variations in job and step elapsed time.
  • Initiation delays – in SMF 30 and 72-3. We’ll come back to this one.

Of course, this isn’t an exhaustive list.

Is WLM Too Slow?

We don’t want WLM to be in “nervous kitten mode”. Namely overreactive. On the other hand we don’t want it to be underreactive, either.

We want WLM to make the right decisions, with the right data, in a timely fashion.

The latter is the sticking point; WLM operates in a matter of seconds, but each change is only going to add a few initiators. This is a “smoothed response” – which is generally better than “nervous kitten”.

So an onrush of submitted batch can lead to initiator delays.

You could dispense with WLM-managed initiators altogether – and hope to manually get it right. And you could have an excess of initiators and watch your batch thrash.

Fortunately there is (soon going to be) another way. Read on.

z/OS 3.1 WLM AI Initiators

This new Artificial Intelligence (AI) function observes your batch and predicts when the work will spike. Before the spike it will nudge WLM towards adding initiators.

I rather like this function and the word “nudge” is doing the heavy lifting here: The AI adds Initiator Delay samples (R723CTDQ in RMF SMF 72-3). This happens ahead of the predicted spike. But the samples are only one factor in WLM’s decision to add more initiators. System conditions have to be taken into account, such as GCP and (as of z/OS 2.5) zIIP.

This design looks good because it minimises the risk of over-initiation causing thrashing. And it tells something WLM has no other way of knowing: When work is coming over the horizon. Such as our 7PM, 10PM, and 2AM spikes.

Fairly obviously, I hope, the work has to be broadly predictable. If there’s a sudden burst of work that is “out of phase” you can’t expect the AI to spot that.

WLM Knows Best – Or Does It?

WLM gets it’s information in a number of categories:

  • Classification rules
  • Goals in the Service Definition / Active Policy
  • Sampled workload attainment
  • System conditions

And now another:

  • AI

The last ones are automatic (with AI only being there if you set it up). The first two are worth talking about:

  • You need to make sure the right batch is classified to the right service class. For example TWS (or OPC to us old folks 😃) can place late-running work on the critical path in a (supposed) Critical Batch service class. But many installations are doing this manually.
  • The batch goals need to be right – both period durations and goal values.

A note on the word “supposed”: TWS will assign such work to a specific service class name. It’s up to you to make sure that really is an appropriate service class. And much of that is to do with the other point: Decent goals.

Parting Shorts

Well, that was a long post. I wanted to get two concepts across:

  • Over-initiation can cause thrashing and SMF 113 and 72-3 can illuminate that.
  • z/OS has a nice new (optional) function that can help with delayed initiation.

Some other parting sho(r)ts:

  • It’s ever more important to get WLM classification and goal setting right.
  • Consider the value and possibility of feeding in more judiciously. Your batch might even perform better.
  • When thinking about whether z/OS 3.1 WLM AI Initiators will eventually be able to help you, plan for the z/OS AI foundation work to enable it and any other Systems Management capabilities that might come along. It’s not trivial but it’s not perversely difficult either.

The Making Of

This post was written on a flight to Istanbul – and tidied up on the flight back. The purpose of the trip is to present z/OS 3.1 to a bunch of Turkish customers. And I met with a few of them. It’s been too long since I was last here – and we all know why. 😕 So I was very pleased to meet them again – and this very topic came up in each call. I suppose there’s a shiny new thing to talk about so inevitably it will come up. But, one shouldn’t be in a “hammer looking for a nail” situation.

I won’t claim any errors or insults are the result of cramped conditions – of course.

Seriously, a longish flight gives me time to think and write.

Published by Martin Packer

I'm a mainframe performance guy and have been for the past 35 years. But I play with lots of other technologies as well.

2 thoughts on “Bursty Batch

Leave a comment