(Originally posted 2007-01-26.)
I presented to the Large Systems GSE meeting in Dudley on Wednesday on “DB2 Data Sharing Performance For Beginners” and an interesting question came up…
“At what distance does performance begin to suffer when moving machines in a parallel sysplex apart?”
As it happens I am dealing with exactly that question in a customer right now. It seems it’s a popular thing to attempt. (I know: One user group question and one customer engagement do not a trend make.) 🙂
So I obviously need another foil or two in my presentation – as if it wasn’t long enough already. 🙂 Here are some early thoughts. (Actually you’re not the first set of guinea pigs – as I drafted something along these lines for my internal IBM blog.) 🙂
The “standard answer” is that performance begins to deteriorate above 10km. That’s fine – but I think I want a little more detail than that…
The question of disk performance at a distance was raised in the GSE discussion. It’s my view that, although the speed of light does come into play here, it’s less of an issue by far than for coupling facility (CF) requests: An elongation of 30 microseconds is far more serious for a CF request than for a disk I/O. (And there are probably far more CF requests a second than there are disk I/Os.) Ignoring protocol differences – such as the number of exchanges per request – disk is probably OK for one or two more orders of magnitude of distance. But don’t ignore disk considerations.
Now let me talk about CF requests…
A synchronous request to a coupling facility accessed using Internal Coupling (IC) links can complete in, say, 30 microseconds. That’s because the link is rather more logical than physical. Sync requests to close-at-hand CFs using Integrated Cluster Bus (ICB) links are longer than that. Sync requests to CFs using ISC links are going to take longer still. ISC links allow much greater distances than ICB links (>> 7 metres). So, just to be using ISC links probably means a step down in request performance. And as the ISC link gets longer the speed of light comes into play. I also think it’s important to note that a CF request involves a number of signals up and down the link.
Coupling Facility Request Arithmetic
So here’s some easy (but probably inaccurate) maths:
Suppose each request requires 10 signals of some sort or other. And suppose the speed of light is 300,000 km/sec. Every kilometer of extra distance requires 10km of extra signal travel distance. Which takes about 35 microseconds (assuming signals propagate at the speed of light – which is the best case). So each request has an additional 35 microseconds of service time.
Further note that a synchronous request causes a z/OS processor engine to spin – so the time of a synchronous request also causes an equivalent CPU time “wastage”. z/OS, since Release 2, has used an adaptive algorithm, converting sync requests into async requests as appropriate. This helps minimise the coupling CPU cost of z/OS engines. And, as the distance increases, z/OS is more likely to convert your request to async.
Async requests don’t cause the z/OS processor to spin but they do take longer to complete. So longer distances may well have a knock on effect on request response time.
(You can measure the times and the sync vs async request rates – at the structure / z/OS level – using the RMF Coupling Facility Report data (SMF 74-4).)
So the basic conclusion is that distance does matter, and probably significantly below 10km – at the request response time level.
Now let’s think about what that means for applications…
If the request is from a z/OS image to a proximate CF structure the response time is going to be lower than if it were to a remote CF. And the difference is obviously going to increase with distance. But the impact on an application (such as DB2 (and its applications in turn) depends on the access rates and characteristics. For an application that always uses a local CF structure its performance will be better than one that uses a mixture of local and remote accesses, and still better compared to an “all remote” access pattern. And the fewer CF accesses per “transaction” the lower the impact.
But you might not be able to choose “all local” access patterns. And you might not have much choice about access intensity – but the latter is a major DB2 Data Sharing tuning theme. So don’t discount that possibility. In any (DB2) case you can use DB2 Accounting Trace to monitor and tune DB2 applications’ use of CF resources.
CF Structure Duplexing
Let me conclude by talking about structure duplexing…
First we need to review how CF Duplexing works. (I’ll use the DB2 Data Sharing example to illustrate it.)
There are two copies of the structure – in separate coupling facilities, on separate machines.
For every request both structures have to perform the same amount of work, and the two CFs coordinate via a dedicated link. The request’s completion is signalled only when the request has been processed in both CFs.
One request is always to a remote CF. The other may or may not be. So in all cases a request is performed at effectively “remote speed”. Which, as I said, elongates with distance.
DB2 Locking (via the LOCK1 structure) and GRS Star are good examples of structures affected by this.
There is only one exploiter: DB2 Group Buffer Pools (GBPs).
In the User-Managed case only the writes are processed by both copies of the structure… An async write to the secondary is followed by a sync write to the primary. When both have completed the request is signalled as having been completed. So writes always go at remote speed.
Reads are always from the primary. At an individual z/OS image level this could be remote or local depending on which machine the z/OS image is on and which machine the primary structure is on.
So the effects here are perhaps less severe. “Perhaps” because it all depends on access rates and patterns. But, for a read-only subsystem, local to all the GBPs all requests are going to be local. For another read-only subsystem accessing the GBPs remotely the CF response times will be higher. (So perhaps balancing the GBPs across the 2 CFs might help keep the application response times consistent.)
The bottom line with duplexing is that it stands to increase CF response times and sensitivity to distance. But equally you can tune DB2 usage, perhaps reducing GBP and locking traffic.
I’ve deliberately written this in a “design in public” style as I’m seeking early customer experiences and perhaps corrections to my thinking. I suppose that’s one of the things blogs are good for.