(Originally posted 2015-11-15.)
I don’t think I’ve written about the concept of Capture Ratio[1] before. To be honest it’s kind of a “nerdy” or “internal” thing. But a recent experience suggests to me it is interesting, even if only for the wrong reason.
What Is Capture Ratio?
Not all CPU in a z/OS system can be attributed to a service class: If you add up all the CPU in SMF 72–3 (Workload Activity) it always amounts to less than the CPU in SMF 70–1 (System-Level).
If we divide Workload CPU by System CPU and turn it into a % we get a Capture Ratio. [2]
So what do we expect? Our observations are
- Generally most systems show capture ratios in the range 85% to 95%.
- Capture ratios vary, but not usually by very much. [3]
- Capture ratios are lower for very low utilisation systems than very high utilisation ones.
- Capture ratios are lower for highly paging systems, and probably for high I/O ones.
Generally I don’t see anything better about a system with a capture ratio in the low 90’s than one in the high 80’s, percentagewise. So I wouldn’t fret about that.
How Do We Use Capture Ratio?
As I indicated at the outset, this has been an internal thing.
In a recent study, to tweak nobody’s nose at all, we saw appallingly low and more or less random capture ratios. It turned out we were missing lots of 72–3 records.[4] So the capture ratio was a good diagnostic tool.
Despite what I said about capture ratio being “internal” we have a standard chart that plots capture ratio for a system by day. This is why I know about the behaviours listed above.
What Went Wrong?
In some studies over the past few years our capture ratio has gone over 100%. It really shouldn’t.
While this has been “subliminally troubling” it hasn’t been enough to make me spring into action. With a recent study, however, we were getting capture ratios of hundreds of percent. Enough to set alarm bells ringing. So Dave Betten and I set to debugging.
It’s all down to zIIP: We only get capture ratios above 100% when both the following are true:
- We have substantial zIIP CPU relative to GCP CPU.
- The zIIP Normalisation Factor is substantially higher than 1.
Our code has a combined capture ratio, plus separate ones for GCP and zIIP CPU. We plot the former but have ignored the latter two.
I saw the pattern: Excessive zIIP capture ratio. Dave debugged the logic, which confirmed it. We’re using the zIIP Normalisation Factor[5] wrong in both the general and zIIP capture ratio calculations.
Adjusting the zIIP capture ratio in a spreadsheet one system’s pair of capture ratios look like this:
I’ve summarised across 8-hour shifts and the x axis is a shift number.
The numbers appear to have “come right” and examining our logic suggests they should be right.
I think I discern that most of the time zIIP capture ratio is slightly above GCP capture ratio. This is what I’d guess, based on zIIPs not doing I/O. But I’m not 100% sure. Future data sets will tell.
Interestingly, the “wrong calculation” zIIP capture ratio was proportionately worse for a system on a machine where the zIIP Normalisation Factor is 6.22 than the ones where it is 2.36. But that’s not surprising.
Putting It Right
One key lesson is: Don’t boost everything by capture ratio to fill in gaps.
- The “low and random” case shows that’s not good idea as you introduce distortion that way.
- The “impossibly high” case shows something fundamental is wrong.
So we know what the “excessively high” case is caused by. Now to get a fix tested and into Production.
And you might expect to see (at least pedagogically) a new chart that separates zIIP Capture Ratio from GCP Capture Ratio. I think this “fine structure” will be useful to glean.
So I hope I’ve shown that Capture Ratio is interesting, even without the bug we’ve troubleshot.
And “every day for us something new, open mind for a different view, and nothing else matters” [6] applies to this. Comme d’habitude. 🙂
-
I’ve seen people write “capture ration” and it’s not been people for whom English isn’t their first language, but it could be autocorrect. 🙂 ↩
-
Of course this is technically wrong as it’s a percent, not a ratio. Nevermind. 🙂 ↩
-
This stability is fairly reassuring. It seems like a real thing. ↩
-
This has now been resolved and we have complete set of 72–3 data. ↩
-
You have to divide what’s in SMF 70–1 and in SMF 72–3 by 256 – which implies a granularity all of its own. ↩