(Originally posted 2017-03-13.)
If confronted by a plethora1 of things to manage you have to be careful with the approach you take.
And so it is with Coupling Facility structures.
Usually I would look at the biggest structures – whether memory, request rate, or CPU is the metric of “bigness”. And normally I’m expecting a few dozen structures in a sysplex.
Recently I was confronted with a scale challenge: Over 800 structures in two coupling facilities2.
Does CFCC Scale To Hundreds Of Structures?
400 structures or so in a coupling facility raises in my mind the obvious question: “Will Coupling Facility Control Code (CFCC) scale well with such a large number of structures?”
Talking to Development I’m assured it will; Even with such large numbers the usual questions, such as CF CPU Busy, arise. But nothing new.
How Do You Analyse Lots Of Structures?
This is the meat of the post.
Basically it’s a case of “think of a metric and sort all the structures by that metric, descending”.
So here are some Lock Structure examples:
- Sort by False Contention rate. This is really the subject of a longer post3 but essentially False Contentions cause extra XCF traffic and hence CPU. This is usually easy to solve: Increase the structure size.
- Sort by XES Contention rate. This time we’re looking to reduce the locking traffic and, if possible, genuine lock collisions. Easier said than done.
And here are some Cache Structure examples:
- Sort by Directory Entry Reclaims.
- Sort by Cross Invalidations.
- Sort by Castouts.
- Sort by Data Element Reclaims.
So this is the same old “top list” approach, but with metrics relevant to CF structures.
You’ll also notice that I’ve listed metrics for Lock and Cache structures separately. This is very much in the spirit of Restructuring.
How Did We Get To So Many Structures?
This question is quite important: If you know how you got to so many structures it might give some insight into how to manage them.
In this case – and it’s clear from the structures’ names and types – there are dozens of DB2 Datasharing Groups. A Datasharing Group has a LOCK1 lock structure4, and several Group Buffer Pool (GBP) cache structures. Their names have the Datasharing Group name embedded in them.
It turns out that the “top Data Element Reclaims structures” list is overwhelmingly dominated by two group buffer pool numbers – GBPs 1 and 10. Each appears across a wide range of Datasharing Groups5. In any case this is a nice pattern to spot.
So I suspect cloning of Data Sharing Groups. And this suggests consistent undersizing across them of these two Group Buffer Pools.
So, the management point I alluded to earlier is “wouldn’t it be nice if the customer had some sort of tool that propagates GBP changes across the estate?”
I don’t (yet) know if this customer has such a tool. But it would be really handy if it did, particularly if it could be persuaded to propagate a doubling of the GBPs’ sizes.
Hand-tuning 800+ structures seems like a non-starter; If that is their reality it’s difficult to get it right. In any case I’m in awe of this customer.
But “one size fits all” is problematic, too.
While the “top list” approach to Performance is not new, it’s the first time I’ve applied it to Coupling Facility structures. And this was caused by the sheer scale.
But I think this approach is useful for even much smaller numbers of structures than 800+.
At this point I’ve written no new code; I’d like to get to some day; Oh well…
One clue this is a huge installation is our standard summary report – without any graphs – turned out to be 28MB of HTML. ↩
The lock structure that has the highest level of False Contention turns out not to be a DB2 (actually IRLM) lock structure. ↩
The customer said that one of these pools was for indexes; A further hint at a “cookie cutter” approach. ↩