This post is about gleaning start and stop information from SMF – which, to some extent, is not a conventional purpose.
But why do we care about when IPLs happen? Likewise middleware starts and stops? Or any other starts and stops?
I think, if you’ll pardon the pun, we should stop and think about this.
Reasons Why We Care
There are a number of reasons why we might care. Ones that come immediately to mind are:
- Early Life behaviours
- System Recovery Boost and Recovery Process Boost
- PR/SM changes such as HiperDispatch Effects
- Architectural context
There will, of course, be reasons I haven’t thought of. But these are enough for now.
So let’s examine each of these a little.
Early Life Behaviours
Take the example of a Db2 subsystem starting up.
At very least its buffer pools are unpopulated and there are no threads to reuse. Over time the buffer pools will populate and settle down. Likewise the thread population will mature. When I’ve plotted storage usage by a “Db2 Engine” service class I’ve observed it growing, with the growth tapering off and the overall usage settling down. This usually takes days, and sometimes weeks.
(Parenthetically, how do you tell the difference between a memory leak and an address space’s maturation? It helps to know if the address space should be mature.)
Suppose we didn’t know we were in the “settling down” phase of a Db2 subsystem’s life. Examining the performance data, such as the buffer pool effectiveness, we might draw the wrong conclusions.
Conversely, take the example of a z/OS system that has been up for months. There is a thing called “a therapeutic IPL”. Though z/OS is very good at staying up and performing well for a very long time, an occasional IPL might be helpful.
I’d like to know if an IPL was “fresh” or if the z/OS LPAR had been up for months. This is probably less critical than the “early life of a Db2” case, though.
System Recovery Boost and Recovery Process Boost
With System Recovery Boost and Recovery Process Boost resource availability and consumption can change dramatically – at least for a short period of time.
In SRB And SMF I talked about early experience and sources of data for SRB. As I said I probably would, I’ve learnt a little more since then.
One thing I’ve observed is that if another z/OS system in the sysplex IPLs it can cause the other systems in the sysplex to experience a boost. I’ve seen time correlation of this effect. I can “hand wave” it as something like a recovery process when a z/OS system leaves a sysplex. Or perhaps as a Db2 Datasharing member disconnects from its structures.
Quite apart from catering for boosts, detecting and explaining them seems to me to be important. If you can detect systems IPLing that helps with the explanation.
Suppose an LPAR is deactivated. It might only be a test LPAR. In fact that’s one of the most likely cases. It can affect the way PR/SM behaves with HiperDispatch. Actually that was true before HiperDispatch. But let me take an example:
- The pool has 10 CPs.
- LPAR A has weight 100 – 1 CP’s worth.
- LPAR B has weight 200 – 2 CP’s worth.
- LPAR C has weight 700. – 7 CP’s worth.
All 3 LPARs are activated and each CP’s worth of weight is 100 (1000 ÷ 10)
Now suppose LPAR B is deactivated. The total pool’s weight is now 800. Each CP’s weight is now 80 (800 ÷ 10). So LPAR A’s weight is 1.25 CP’s worth and LPAR C’s is 8.75 CP’s worth.
Clearly HiperDispatch will assign Vertical High (VH), Vertical Medium (VM), and Vertical Low (VL) logical processors differently. In fact probably to the benefit of LPARs A and C – as maybe some VL’s become VM’s and maybe some VM’s become VH’s.
The point is PR/SM behaviour will change. So activation and deactivation of LPARs is worth detecting – if you want to understand CPU and PR/SM behaviour.
(Memory, on the other hand, doesn’t behave this way: Deactivate an LPAR and the memory isn’t reassigned to the remaining LPARs.)
For a long time now – if a customer sends us SMF 30 records – we can see when CICS or IMS regions start and stop.
Architecturally (or maybe operationally) it matters whether a CICS region stops nightly, weekly, or only at IPL time. Most customers have a preference (many a strong preference) for not bringing CICS regions down each night. However, quite a few still have to. For some it’s allowing the Batch to run, for a few it’s so the CICS regions can pick up new versions of files.
Less importantly, but equally architecturally interesting, is the idea that middleware components that start and stop together are probably related. Whether they are clones, part of the same technical mesh, or business wise similar.
How To Detect Starts And Stops
In the above examples, some cases are LPAR (or z/OS system) level. Others are at the address space or subsystem level.
So let’s see how we can detect these starts and stops at each level.
At the system level the best source of information is RMF SMF Type 70 Subtype 1.
For some time now 70-1 has given the IPL date and time for the record-cutting system (field SMF70_IPL_TIME, which is in UTC time). As I pointed out in SRB And SMF, you can see if this IPL (and the preceding shutdown) was boosted by SRB.
LPAR Activation and Deactivation can also, usually, be detected in 70-1. 70-1’s Logical Processor Data Section tells you, among other things, how many logical processors this LPAR has. If it transits from zero to more than zero it’s been activated. Similarly, if it transits from more than zero to zero it’s been deactivated. The word “usually” relates to the fact that the LPAR could be deactivated and then re-activated in an RMF interval. If that happens my code, at least, won’t notice the bouncing. This isn’t, of course the same as an IPL – where the LPAR would remain activated throughout.
The above reinforces my view you really want RMF SMF from all the z/OS systems in your estate, even the tiny ones. As that way you’ll see the SMF70_IPL_TIME values for them all.
When I say “Subsystem Level” I’m really talking about address spaces. For that I would look to SMF 30.
But before I deal with subsystems I should note an alternative way of detecting IPLs: Reader Start Time in SMF 30 for the Master Scheduler Address Space is within seconds of an IPL. Close enough, I think. This is actually the method I used in code written before the 70-1 field became available.
For an address space, generally you can use its Reader Start Time for it coming up. (Being ready for work, though, could be a little later. This is also true, of course, for IPLs. And SMF won’t tell you when that is. Likewise for shutting down.) You could also use the Step- and Job-End timestamps in SMF 30 Subtypes 4 and 5 for when the address space comes down. In practice I use Interval records and ask of the data “is the address space still up?” until I get the final interval record for the address space instance.
When it comes to reporting on address space up and down times I group them by ones with the same start and stop times. That way I see the correlated or cloned address spaces. This is true for both similar address spaces (eg CICS regions) and dissimilar (such as adding Db2 subsystems into the mix).
As I hope I’ve shown you, there are lots of architectural and performance reasons why beginnings and endings are important to detect. I would say it’s not just about observation; It could be a basis for improvement.
As I also hope I’ve demonstrated, SMF documents such starts and stops very nicely – if you interrogate the data right. And a lot of my coding effort recently had been in spotting such changes and reporting them. If I work with you(r data) expect me to be discussing this. For all the above reasons.