(Originally posted 2013-08-02.)
In my experience Coupling Facility Duplexing configuration and performance is something that tends to get neglected – once the initial configuration decisions have been made. After all it’s rare that customers rework their Duplexing design.
Over the past few weeks I’ve been comprehensively reworking my Coupling Facility tabular reporting, as I recently mentioned in Coupling Facility Topology Information – A Continuing Journey .
This post is about the Duplexing part of that. If you agree it’s time to review your Duplexing reporting read on…
In the previously-mentioned post I talked about signalling rates and overall times at the CF level – for Duplexing. I have those now. While those are interesting they are rather macro level and don’t really talk about outcomes that directly affect applications. (Actually nothing does but I think you’ll agree specific middleware-related structures are more interesting than overall CF numbers when it comes to tuning e.g. Data Sharing applications.)
So let’s talk about structures…
User- Versus System-Managed Duplexing
First, there is only one exploiter of User-Managed Duplexing: DB2 Group Buffer Pools.
Second, the two types are very different from the instrumentation (and other) perspectives: Attempting to treat them the same is a bad idea.
Detecting Primary And Secondary Structures
Formally you need RMF SMF from the Sysplex Data Gatherer z/OS system. In several of my sets of data I don’t have that. So I have to improvise.
But first the formal bit: For the Sysplex Data Gatherer one Request Data Section is written for each structure. Bits in field R744QFLG in this section denote whether the structure is the old instance (primary) or the new instance (secondary) or neither. “Old” and “new” might seem strange names but duplexing is built on top of structure rebuilding, so the terms are not so strange.
If you don’t have data from the Sysplex Data Gatherer you can sometimes still get the answer:
For User-Managed structures (DB2 Group Buffer Pools) the traffic to the primary is higher than to the secondary. But if there’s no traffic you’re stuck. So my code performs the traffic test and reports accordingly.
Actually when I say “higher” I really mean “much higher”.
There is no such traffic test for System-Managed. In Performance terms the primary and secondary are generally identical: The traffic is much the same. So it doesn’t really matter which is which.
Similarly, for the zero-traffic case Performance isn’t a hot topic anyway. So again detecting which is primary and which secondary isn’t important.
Duplexing States And Timings
As I mentioned, User- and System-Managed Duplexing are somewhat different.
With User-Managed, the “user” (DB2) coordinates writing to primary and secondary structures. So none of what I’m about to tell you applies to the User-Managed case.
With System-Managed, XES and the two CFs coordinate: And operations in both CFs generally have to complete together.
The coordination for System-Managed manifests itself in the data in a series of fields in the Request Data Section for each version of the structure (whether the system is the Sysplex Data Gatherer or not). These fields generally have a count of events, total time for the events and the sum of the square of the times for the events – enough to calculate average and standard deviations.
So these events (and they’re reported in RMF’s Coupling Facility Activity Report) take System-Managed Duplexing down to the structure level. (The numbers are, unsurprisingly, zero for User-Managed.)
One thing they allow you to do is see one aspect of the Service Time cost of Duplexing. I say “one aspect” because, although there are timings in these fields, Duplexing introduces other effects.
For example a non-duplexed LOCK1 lock structure might have service times in the region of 3 to 20 microseconds, depending on link technology. (Here one would expect all the requests to be performed synchronously and these timings reflect that.)
But use System-Managed Duplexing with it and often most of the requests are performed asynchronously and with service times in the tens of microseconds (or hundreds over, say, extended distances).
But at least the structure-level counters and timings can help point out problems.
But there is a role here for the CF-level duplexing statistics: The path-level signal latency times for Duplexing links (if you have them) can also point to why Duplexing performance is what it is. RMF converts them to estimates of distance at 1 kilometer for each 10 microseconds, which is a clue that a lot has to do with distance.
One final word of caution: None of the RMF statistics related to Duplexing – of either flavour – say anything about application impacts from Duplexing. Where there is any evidence at all is from something like DB2 Accounting Trace where maybe the Asynchronous Database I/O Write Wait time is extended. But this is scant information at best.
Realistically the best you can do is give the performance-critical structures the best performance you can.
So, as you can tell, I’ve been busy warming over my CF Duplexing code. A few studies more and I’ll probably do it again. 🙂