(Originally posted 2009-11-09.)
There’s a pattern I’ve seen over a number of test Parallel Sysplex environments over the past few years, a couple of them in situations this year:
It’s not much use drawing performance inferences from test environments if they’re not set up properly for performance tests.
Sounds obvious, doesn’t it?
There are two problem areas I want to draw your attention to:
- Shared Coupling Facility Images
If you run a performance test in an environment with shared coupling facility images you stand to get horrendous request response times and the vast majority of requests going async (given a chance). I’ve even seen environments where XCF refuses to use coupling facility structures and routes ALL the traffic over CTCs. (And I’ve seen a couple of environments where there are no CTCs to route it over and XCF traffic is then reduced to a crawl.)
- "Short Engine" z/OS Coupled Images
In a recent customer situation I saw the effect of this: The customer was testing DB2 Loads where actually it was a bunch of SQL inserts. They were also duplexing the LOCK1 structure for the data sharing group. The Coupling Facility setup was perfect, but still response times became really bad once duplexing was established for the LOCK1 structure. Two salient facts: Because of duplexing all the LOCK1 requests were async. XCF list structure request response times were always awful.
The answer to why this problem occurred lies in understanding how async requests are handled: The coupled z/OS CPU doesn’t spin in the async case. In the "low LPAR weight relative to logical engines online" case the z/OS LPAR’s logical engines were but rarely dispatch on physical engines. This meant there was a substantial delay in z/OS detecting the completion of an async request. Hence the elongated async response times. As I said, the LOCK1 structure went async once it was duplexed.
As it happens the physical machine wasn’t all that busy: Allowing the LPAR to exceed share – using a soaker job – ensured logical engines remained dispatched on physical engines longer. And, perhaps paradoxically, the async request response times went right down. This, I hope, reassured the customer that in Production (with "longer-engine" coupled z/OS LPARs) async coupling facility response times ought to be OK.
Now, this is just Test. But it could unnecessarily freak people out. But, hopefully, it’s easy to see why Test Parallel Sysplex environments might perform much worse tan Production ones.
(I’m guessing you’re going "duh, I knew Test would be worse than Prod". 🙂 But these two cases are specifics of why Test might be even worse compared to Prod than expected.)
Anyhow, I thought they were interesting. And I have seen 1. quite a few times now. 2. not so much, in fact only once so far.