(Originally posted 2017-03-11.)
Not sugar and spice and all things nice. 🙂
Seriously, I'm interested in how Workload Manager (WLM) goals come to be.
I've talked about WLM quite a bit over the years and one theme has repeated itself a number of times: “Just how did you arrive at that goal?”
As I wrote in Analysing A WLM Policy – Part 2 I see three categories of WLM policies:
- IBM Workload Manager Team based policies.
- Cheryl Watson based policies.
- “Roll Your Own” policies.
Corollary: All WLM policies “degenerate” to Category 3. 🙂
(Something I thought about making a footnote but decided it was too important: If you're not actively maintaining your policy enough to look a fair amount like Category 3 you're probably not maintaining it enough to meet current needs.)
This post isn't really about the structure of a WLM policy, but rather the goal values for each service class period.
Suppose you have a goal like"95% of transactions to complete in 22 milliseconds". There are questions I'd like to ask about this – both about the “95%” part and the “22ms” bit. Here are a couple to start with. More in a minute.
- Is this goal realistic?
- Is this goal necessary?
Now, this is a (Percentile) Response Time goal. I have questions in a similar vein about Velocity goals.
Response Time Goal Values
Response time goals come from somewhere. Quite often it's a case of “we'll ask for what we're currently achieving”. I guess this mostly answers the first question:
- Is this goal realistic?
It tends to answer it because, presumably, a goal is likely to remain achievable. But not always.
The second question is a little more awkward:
- Is this goal necessary?
It's almost the same as another question:
- Did the business ask for this goal value?
I'd probably be living in a fantasy world if I thought the conversations about performance between IT folk and their customers were as extensive as they ought to be.
Here's another one:
- Would it help to achieve shorter response times?
Better performance is rarely free. So, to reduce that response time from e.g. 22ms to 15ms might well take money. Money for CPU (and hence software) and for memory being two obvious examples. People time to tune (e.g. SQL) is another.
- Is e.g. 95% the right clipping level?
This is a difficult one. It depends on your attitude to outliers – and whether you expect to get many.
And here – as with our sample goal – there are two dimensions: Percentage and target response time.
I recently came across a pair of CICS response time goals. One had a tighter response time but a lower percentage. The other had a laser response time and a higher percentage. It would be very difficult to establish which was harder. And it would be charitable to assume the site was consciously handling differing outlier patterns. My suggestion would be to consider combining these two CICS service classes.
And for an average response time goal you really are allowing for a lot of variability.
- What do we actually expect WLM to do to help?
There are some goals that are utterly unachievable no matter what WLM tries to do. For example, locking issues are rarely1 solved by WLM. So setting an unattainable goal in the face of that is asking for trouble.2 WLM also can't make a processor faster, nor a transaction take substantially fewer cycles.
But the gist of response time goals is they have a tangible relationship to “real world” outcomes. But in modern complex IT environments z/OS internal response times are somewhat “semi-detached” from what the end user sees.
Velocity Goal Values
Velocity goals are less directly relatable to real world outcomes. It would be rare for a business to demand a velocity of, say, 70% from the IT folks. It would be more usual to request “top priority” though that doesn't necessarily mean “Importance 1”.3
I've already touched on a lot of the questions around velocity goal values – as they are much the same as for response time goals.
But there are some twists.
- Just because a goal value was right before is it still right?
We recommend customers re-evaluate velocity goals in the light of things like processor configuration changes and disk controller replacements. For example, more capacity might lead to less CPU queuing. This would show up in fewer “Delay For CPU” samples. Conversely, if this upgrade was achieved with faster processors there might well be fewer “Using CPU” samples. So the velocity could go change in either direction.
So, I recommend people understand velocity goal attainment from two angles:
- The Using and Delay samples – which I hinted at above.
- How the velocity varies with load
These two are beyond the scope of this post. But both of the above feed into the assessment of what's realistic and how it might change with workload and system configuration changes.
The general drift of this post is that goal values need just as much care as goal structures.
I have a slide I usually put into every WLM section of a workshop. It outlines seven questions I like to ask about a WLM policy, questions installations should ask themselves periodically.
To it I'd like to add a “Bonus Question”: Just where did you get these goal values from anyway?
Having asked that question I think I can make the conversation very interesting indeed. 🙂