(Originally posted 2014-05-02.)
If you have a large mainframe estate it can be difficult to keep track of when the various moving parts start and stop. For example, if you’re a Performance person it’s quite likely nobody bothered to tell you when the systems were IPL’ed. You might well know what the regime for starting and stopping CICS is but I wouldn’t.
As you know I’m curious as to how customers run their installations and starting (and stopping) pieces of infrastructure interests me. I’m also impressed when a piece of infrastructure has been up for years – as sometimes happens. Up until now it’s been a matter of folklore such as “the installation that didn’t take an application down for 10 years”.[1]
But I’ve turned my attention to when z/OS is IPL’ed and when key address spaces start and stop. I’m sharing the technique in case it’s something you want to do.
I’m also interested in the sequence and timing between a z/OS system’s IPL and when important subsystems are up.[2]
I’m not going to pretend to be an expert in how systems are restarted or recovered but I am going to take an interest. Knowing what’s “normal” is, I think, useful.
Simple Instrumentation
You probably know that SMF 30 subtypes 4 and 5 describe steps and jobs, respectively. You probably also know SMF 30 subtypes 2 and 3 are interval records.
If you’re already collecting these you’re in good shape as Reader Start Time is in all of these. It’s all you need to figure out when stuff starts.[3]
I prefer the interval records as
Most customers send me SMF 30 interval records. (I get the others for batch studies.)
You can get the Reader Start Time from these even when the address space is still up. (When the Reader Start Time changes for an address space I know it’s restarted.)
Summarisation And Reporting[4]
For some address space types I report each job name separately. CICS regions are a good example of this. For others I pick the first one for a subsystem. DB2 and MQ subsystems are a good example of this.
To detect an IPL I choose the address space whose program is IEEMB860. In principle the job name could vary. And yes I know that “pressing the button” on IPL invokes NIP etc before this (the Master Scheduler) address space starts up.
I only print date, hour and minute for Reader Start Time. It goes to hundredths of seconds but I’m not interested in that level of detail.[5]
In my report I sequence by timestamp. That makes it easier to see when an IPL is followed by, say, a DB2 start and then some CICS regions. I could probably create a useful Gantt chart from this, but today I don’t. The technology’s there to make this easy to do.
Conclusion
Looking at this data gives me a much better idea how installations manage the lifecycles of their address spaces. If I talk to you about this topic it’ll probably be from this data and I might well refer you to this blog post. This is also one of the topics in the 2014 revision of my “Life And Times Of An Address Space” presentation.
Two final points:
- Reader Start Time doesn’t denote the time that a subsystem became available, so it’s not that good for application availability. You probably want to use the subsystem’s own instrumentation, such as logs, for that.[6]
- One of the merits of the Reader Start Time technique is that it’s very “light touch”.
-
I made that one up but it’s not unrepresentative. ↩
-
I guess the readers of the System z Mean Time to Recovery Best Practices Redbook would be interested also. ↩
-
Other start timestamps are available but this one does just fine. ↩
-
I expect to evolve my reporting. I usually do. ↩
-
People analysing IPLs probably are, or at least down to the second. And they’re probably interested in the differences between the various start timestamps. I could take an interest in the precise sequence in which “low Jobid” address spaces start up. Likewise the sequence in which e.g. clusters of CICS regions or DB2 address spaces start up. The data’s all there. ↩
-
Or use what I call the “roaring silence” technique. A good example would be when the SMF 101 (DB2 Accounting Trace) record cutting rate drops to zero for a few minutes. That might denote a restart, with the subsystem being “back in business” once records start to be cut again. ↩
One thought on “Once Upon A Restart”