(Originally posted 2017-06-15.)
In Some WLM Questions I outlined my approach to looking at WLM implementations. It was necessarily very high level, but the intention was twofold:
- To prime customers about the kinds of questions I might be discussing with them – if I ever saw their data.
- To give anyone maintaining a WLM policy some structure. It remains my view that WLM needs care and feeding, on a not-infrequent basis.
You could argue these two purposes are essentially what this blog is all about.
So, this post does the same thing but for Parallel Sysplex. Actually it’s Part 1 of 2, dealing with Coupling Facility (CF) questions. The other part (covering XCF) will be along presently.
Again, expect a high level treatment. There are plenty of posts in this blog that talk at a more detailed level.
(Perhaps Superfluous) Disclaimer: This isn’t all about performance and capacity, because I’m not either.
I’ll structure this post in two pieces:
That’s how I look at Coupling Facility, so it seems as good a structure for this post as any.
Note: Everything I’m talking about is instrumented with SMF Type 74 Subtype 4.
If we were examining z/OS systems we’d start by looking at resources, so it’s natural to look at coupling facilities the same way.
The difference, though, is in what those resources are and how they behave. For example:
- Coupling facilities don’t do I/O in the conventional sense.
- Coupling facilities don’t page.
- Memory management is more or less static.
- Access to resources is not policy-driven; There is no WLM or SRM for coupling facilities.
So let’s examine the different types of resources.
In this piece I assume the coupling facility has dedicated processors.
A basic metric is CPU utilisation. We talk a lot about how busy a coupling facility should be, both for steady state and for recovery situations. As a rough guideline, a CF that tops 40% is one where I would be concerned about the effects of growth. One above 50% I’d be more immediately concerned about. Here I’m touching on the topic of “white space”.
Usually a sysplex has more than one coupling facility. While I wouldn’t be fetishistic about it, I would investigate the reasons for any significant imbalance.
Which brings us onto a point that strays into the second part of this post: We can readily see which CF structures drive CPU utilisation. So we know which structures might contribute to imbalance. We’ll come back to CF structure-level CPU in a bit.
Memory usage is much more static than with z/OS; You allocate structures and rarely change their size. But this doesn’t make CF memory a boring topic.
As with CPU, the memory instrumentation is good; You can, for instance, readily see how much is installed and how much is free. Again, the concept of “white space” exists for memory. Here, we’re more interested in recovering structures from a failing CF into a surviving one.
But most of my discussions with customers about CF memory haven’t been about leaving space. I’m finding quite a few who have tons of free memory; The point has been to encourage them to exploit the memory. The structures discussion below touches on this also.
Talking of structures, my code calculates how much extra memory would be taken (and how much less would be free) if all structures went to their maximum size. Usually there’s plenty free, even if they did.
Links And Paths
In my experience link and path utilisation are rarely a problem, but there’s plenty of CF-level instrumentation for the cases where this is a problem. My guess is customers generally get this right. In any case the remedies would usually be simple.
I’ve written extensively about CF path statistics. These are now excellent to the point where there’s only one more thing I’d like to see: The number of times a path is chosen.
In the category of “infrastructural understanding” would, of course, be the path latency – a proxy for distance.
Structures are where it gets really interesting, because this is where the applications and middleware come to life. Generally it’s very easy to discern what a structure is for. Indeed my code discerns things like DB2 Data Sharing groups and CICS structures.
Here is an example of a DB2 Data Sharing group, using two CFs. The numbers are the request rates. The obfuscated text is the two CFs’ machine names.
You can, for example, see Group Buffer Pool (GBP) Duplexing but the LOCK1 structure not being duplexed.
There are a number of themes I like to explore:
Structure performance with increasing request rate
A structure whose response time stays stable with increasing traffic is a good thing; One that deteriorates needs investigating.
CPU usage by structure
This is useful for both capacity planning and understanding the structure’s performance. As an example of the latter, it’s not uncommon for a lock structure on a “local” (IC link connected) CF to have almost all of its response time accounted for by CF CPU – especially at higher request rates.
Memory exploitation and structure sizing
As I said just now, structure exploitation of memory is a key theme. The two main examples are:
- Increasing lock structure sizes, to avoid false contentions
- Increasing directory entry or data element sizes for cache structures to reduce reclaims
There is no information on CF links at the structure level, nor do I think there needs to be.
This has been, necessarily, a high-level view. I wanted to give you an overall structure to work from. There are plenty of other blog posts that go rather deeper.
My interest in in coupling facilities is not just performance and capacity; The setup aspects help me get closer to how it is to be a customer with a parallel sysplex (or several).
In the next post I’ll talk about XCF, the other (and original) sysplex component.
Oh, you like surprises, do you? 🙂 ↩
If we were talking about z/OS I’d be talking about resources and applications; This is broadly analogous. ↩
My code to process this data continues to evolve, covering more themes and doing it more succinctly. ↩
Though the method extends reasonably well to, unusual in Production, shared engines. The data is there. ↩
Duplexing, of course, alters this picture. ↩
But LOCK1 not being duplexed is OK as CFPRODA is an external CF. ↩