They say beauty is in the eye of the beholder. But I hope you’ll agree this is a pretty interesting graph.
It is, in fact, highly evolved – but that evolution is a story for another time and place. I want to talk about what it’s showing me – in the hope your performance kitbag could find room for it. And I don’t want to show you the starting point which so underwhelmed me. 😀
I’m forever searching for better ways to tell the story to the customer – which is why I evolve my reporting. This one is quite succinct. It neatly combines a few things:
- The effect of load story.
- The distance story.
- A little bit of the LPAR design story.
- The how different coupling facility structure types behave story.
Oh, I didn’t say it was about coupling facility, did I?
I suppose I’d better show you the graph. So here it is:
You can complain about the aesthetics of my graphs. But this is unashamedly REXX driving GDDM Presentation Graphics Facility (PGF). I’m more interested in automatically getting from (SMF) data to pictures that tell the story. (And I emphasised “automatically” because I try to minimise manual picture creation fiddliness. “Picture” because it could be a diagram as much as a graph.)
So let’s move on to what the graph is illustrating.
This is for a(n) XCF (list) structure – where the requests are issued Async and so must stay Async.
- Each data series is from a different system / LPAR in the Sysplex.
- This is the behaviour across a number of days for these systems making requests to a single coupling facility structure.
- Each data point is an RMF interval.
Service Times Might Vary By Load
By “load” I mean “request rate”.
I would be worried if service times increased with request rate. That would indicate a scalability problem. While I can’t predict what would happen if the request rate from a system greatly exceeded the maximum here (about 30,000 a second for PRDA) I am relieved that the service time stays at about 20 microseconds.
Scalability problems could be resolved by, for example, dealing with a path or link issue, or additional coupling facility capacity. Both of these example problem types are diagnosable from RMF SMF 74-4 (which is what this graph is built from).
You’ll notice the service times split into two main groups:
- At around 20μs
- At around 50μs
The former is for systems connected to the coupling facility with 150m links. The latter is for connections of about 1.4km (just under a mile). The difference in signalling latency is about (1.4 – 0.15) * 10 = 12.5μs. (While I might calculate that the difference is service time is around 2.5 round trips I wouldn’t hang anything on that. Interesting, though.)
It should be noted, and I think I’ve said this many times, you get signalling latency for each physical link. A diversity in latencies across the links between an LPAR / machine and a coupling facility tends to suggest multiple routes between the two. That would be a good thing from a residence point of view. I should also note that this is as the infinibird 😀 flies, and not as the crow does. So cables aren’t straight and such measurements represent a (quite coarse) upper bound on the physical distance.
Coupling Technology Matters
(Necessitated by the distance, the technology between the 150m and 1.4km cases is different.)
I’ve taught the code to embed the link technology in the legend entries for each system / series.
You wouldn’t expect CE-LR to perform as well as ICA-SR; Well-chosen, they are for different distances. Similarly, ICA-SR links are very good but aren’t the same as IC links.
LPAR Design Matters
LPAR design might be “just the way it is” but it certainly has an impact on service times.
Consider the two systems I’ve renamed to TSTA and TSTB. They show fairly low request rates and, I’d argue, more erratic service times.
The cliché has it that “the clue is in the name”. I’ve not falsified things by anonymising the names; They really are test systems. What they’re doing in the same sysplex as Production I don’t know – but intend to ask some day.
The point, though, is that they have considerably lower weights and less access to CPU.
Let me explain:
When a request completes the completion needs to be signalled to the requesting z/OS LPAR. This requires a logical processor to be dispatched on a physical. This might not be timely under certain circumstances. Most particularly if it takes a while for the logical processor to be dispatched on a physical.
What’s good, though, is that the PRD∗ LPARs don’t exhibit the same behaviour; Their latency in being dispatched and being notified the request has completed is good.
Different Structures Perform Differently
I’ve seen many installations in my time. So I know enough to say that, for example, a lock structure didn’t ought to behave like the one in the graph. Lock structure requests tend to be much shorter than cache or list or serialised list structures.
What I’m gradually learning is that how structures are used matters. You wouldn’t expect, for instance, a VSAM LSR cache structure to behave and perform the same as a Db2 group buffer pool (GBP) cache structure.
I say “gradually learning” which, no doubt, means I’ll have more to say on this later. Still, the “how they’re used” point is a good one to make.
Another point in this category is that not all requests are the same, even to the same structure. For example, I wouldn’t expect a GBP castout request to have the same service time as a GBP page retrieval. While we might see some information (whether from RMF 74-4 or Db2 Statistics Trace) about this I don’t think the whole story can be told.
This example doesn’t show Internal Coupling (IC) links. It also doesn’t show different coupling facility engine speeds. So it’s not the most general story.
- The former (IC links) does show up In other sets of data I have. For example a LOCK1 structure at about 4 μs for IC links and about 5 for ICA-SR links.
- To show different coupling facilities for the same structure name sort of makes sense – but not much for this graph. (That would be the duplexing case, of course.)
Let me return to the “how a structure of a given type is used affects its performance” point. I think there’s mileage in this, as well as the other things I’ve shown you in this post. That says to me a brand new Parallel Sysplex Performance Topics presentation is worth writing.
But, I hope you’ll agree, the graph I’ve shown you is a microcosm of how to think about coupling facility structure performance. So I hope you like it and consider how to recreate it for your own installation. (IBMers can “stop me and buy one”.) 😀
By the way, I wrote this post on a plane on my way to SHARE in Atlanta, March 4, 2023. So you could say it was in honour of SHARE. At least a 9.5 hour plane ride gave me the time to think about it enough to write the post. Such time is precious.
2 thoughts on “A Very Interesting Graph”
Possible typo: Should “VSAM LSR” be “VSAM RLS”?
Sigh! You’re right. Spell checker would never have found that.