(Originally posted 2015-11-06.)
This is about the third time I’ve written about this, and it probably won’t be the last. 🙂 
I was presenting to customers about the Coupling Facility Path Latency statistics I’ve previously spoken of when one of them told me of the following incident. I’m sure he won’t mind me sharing it without you, so long as I don’t identify the source.
The customer has two zEC12 machines, with Internal Coupling Facilities (ICFs) in each, and with z/OS LPARs in each machine, using Infiniband links and Internal Coupling links to these ICFs. 
The customer believed they had two groups of four Infiniband paths between one z/OS image and a remote CF. These groups of paths take routes said to be  5km and 8km long.
One day they looked at an RMF Coupling Facility Activity postprocessor report and saw the new path data information, new with OA37826 and CFLEVEL 18. That was a nice surprise.
What wasn’t a nice surprise was the report indicating three paths at 8km and five paths at 5km. This was not what they expected.
Their initial suspicion was that the routing was wrong and the instrumentation right. But it proved otherwise:
- First, by getting an independent measurement of the path lengths, they discovered that all the paths were of the correct length.
- Second, by moving paths between adapters, they isolated the problem to a specific adapter.
So, the upshot was that the adapter card was reporting the incorrect distance. The card has, fairly obviously, been replaced and everything is fine now.
There’s no suggestion there was anything else wrong with the card, but it’s good it was replaced. An interesting question is whether incorrect latency measurements could cause poor routing decisions, but I certainly can’t comment on that publicly.
Another question I can’t answer is whether the latency measurement suddenly went bad; All we know is that when the customer looked at the Coupling Facility Activity report for the first time it had the wrong number in it.
While I don’t propose to write reporting that assumes dynamically changing CF Path Latency values I do think it’s worthwhile to look occasionally at this data. I always do when I get customer data – and most customers have OA37826 applied and are at CFLEVEL 18 or higher.
So please do look at this every so often, including right now, as a useful verification exercise.
I’m now keeping a list of my blog posts on Coupling Facility links in a separate file. Here’s what it looks like so far:
- What’s The Latency Really?
- What’s The Latency, Kenneth?
- The Effect Of CF Structure Distance
- The Missing Link?
- Coupling Facility Topology Information – A Continuing Journey
- System zEC12 CFLEVEL 18 RMF Instrumentation Improvements
I’m learning you can never tell when the well will run dry with technology, and CF Path Latency is certainly a case of this. ↩
This is so common a configuration I’d call it an architectural pattern if I were pretending to be an architect (which I sometimes do). 🙂 ↩
Pardon my skepticism on this topic; Long-term readers will know it’s justified. ↩
Hopefully nobody is mad at me for mentioning a card went bad. We all know hardware can fail and that’s why we design configurations and procedures to cope with it. ↩