(Originally posted 2008-06-10.)
I had to eat a little bit of humble pie today – because I made an elementary mistake. I’m going to share it with you – to save you making it, too. 🙂 And I’m very sure I’m not going to make it in a customer situation. (Residencies are great for making mistakes in a safe environment.) 🙂
In CFLEVEL 15 and z/OS Release 9 RMF you get a new field at the structure level: R744SETM.
This is the CPU used in the Coupling Facility to process requests to an individual structure. When you add the individual R744SETM values – for all the structures in the Coupling Facility – you get the same value as the sum of the R744PBSY times for all the processors. This is because R744PBSY is defined as the time spent processing requests. (You get this by individual coupling facility processor.) If you were to calculate a capture ratio using R744SETM and R744PBSY you’d always get 100% – so it’s probably not worth the bother. 🙂
Normally one calculates Coupling Facility busy using the formula 100*R744PBSY / (R744PBSY + R744PWAI) and summing over the processors. I’m beginning to think that 100*R744PBSY / SMF74INT is a more useful calculation (and it collapses down to the usual formula for dedicated processors).
So now to my mistake…
I wanted to compare the CPU time by structure to request service times. You can do this if you compare R744SETM to R744SSTM+R744SATM. (SSTM is the sum of Sync service times and SATM the same but for Async.) So I did this comparison and carefully noted that R744SETM is a 8-byte floating point number and the other two are 8-byte integers but that all three are in microseconds over the interval. What I failed to do is to realise that you need to add up the service times over all the z/OS systems connecting to the structure…
R744SETM is for all requests to the CF structure. The others are by z/OS system.
Why does this matter? Because I suffered the embarrassment 🙂 for several fraught hours of not knowing why the CPU time was longer than the elapsed time. 🙂
There are in fact a couple of cases where CPU time isn’t included in the service time. I intend to write those up in the book. But basically they are ones where z/OS has been given the “request complete” signal but there is still some processing to do. In these cases we continue to accumulate CPU but not service time. (And, yes, these cases are logically OK.)
So, I don’t feel too bad about my mistake, particularly as it led to some learning. And I look forward to making the same sorts of mistakes with Coupling Facility Cache Statistics – which are also collected at the “all systems” level. 🙂