(Originally posted 2018-05-27.)
I must be mad to do what I do… 🙂
Specifically, I’m talking about maintaining code to process SMF records, rather than relying on somebody else to do it.
Recently I sat down with someone who is getting started with processing SMF. In fact they’re building sample reporting to go against a new piece of technology that already maps the records. My contribution was to help them get started, given my knowledge of the data.
Our infrastructure consists of two main parts:
- Building databases
- Reporting against those databases
This post is almost entirely about the former. In passing, I would note any SMF processing tools have to sit in a framework where whatever queries you run can turn into useful work products – such as tabular reports, graphs, and diagrams.1 In our case our reporting is essentially a mixture of REXX and GDDM Presentation Graphics Facility (PGF), both of which the query engine integrates well with.
An Example: SMF 70 LPAR Data
SMF 70 Subtype 1 is the record that contains system- and machine-level CPU information. It’s a pretty complicated 2 but very valuable record.
So let me describe to you how to get LPAR-level CPU, which is a fundamental report – and the one I showed my colleague how to do.
SMF 70–1 has a section called the “Logical Processor Data Section”. In a record you get one of these for each logical engine configured to each LPAR on the machine.3 You also get one for each physical processor -in pseudo-LPAR “PHYSICAL”.
Essential fields are the Virtual Processor Address (VPA), Physical Dispatch Time (PDT), and Effective Dispatch Time (EDT). PDT is EDT plus some PR/SM cost. Depending on your needs, you probably want PDT when calculating how busy the logical processor is. You probably also want to know which processor pool it is in – CIX, and how much of the RMF interval the processor was online for (ONT).
Most customers define some offline engines for each LPAR – so actually ONT is important.
For a modern z/OS LPAR GCPs have CIX=1 and zIIPs have CIX=6.
This is really valuable information, but there’s a catch: The Logical Processor Data Section doesn’t tell you which LPAR each section is from. So you have to get that from elsewhere:
For each LPAR on the machine there is a Logical Partition Data Section. As well as the name, there is a field that tells you which Logical Processor Data Section is the first one for this LPAR, and another that tells you how many there are.4 So we use this to put LPAR-level information into a virtual record we build from the Logical Processor Data Section.
We store information summarised at the LPAR level, as well as at the individual logical processor level. Importantly, the LPAR-level summary is a the Pool level, rather than conflating GCPs and zIIPs.
PDT, EDT, and ONT are times. To convert these to percentages of an interval I need to extract the Interval length (INT. But this is in neither the Logical Processor Data Section nor the Logical Partition Data Section. It is in the RMF Product Section. So I need to extract it from there. But INT is a time in a very different format from the others. So some conversion is necessary.
I’ve grossly simplified what’s in the relevant sections – restricting myself to the one problem area: LPAR Busy.
We do our data mangling as we build our database.
Our code, by the way, is Assembler. I inherited it about 15 years ago and have extensively modified it as the CPU data model has evolved. If the above sounds complicated I’d agree with you. Register management is a real issue for us. Perhaps I should convert to baseless. But of course that carries a major risk of breakage.
So that’s our design: “Mangle and store”. But that’s not the only design: You could do the mangling every time you run the query. I would suggest that can lead to a maintenance burden and a risk of loss of consistency. So, if you were doing this in SQL you’d probably want a view.
And if you did mangle the data every time you run such a query you’d want to be careful about performance. Fortunately, SMF 70 isn’t a high volume record. So maybe performance isn’t critical.
Of course with SQL you could mangle and store, querying the resulting table. That’s probably the best design for any new SMF processor. By the way, I’ve no idea what other SMF processing tools do.
And do you throw away the raw, unprocessed data tables? I would say not; We don’t.
Parting Shorts 🙂
There are lots of people out there, including several from vendors, who claim to be able to process SMF data. If all they do is map the records, be very careful as you’ll have to do the hard work yourself.5
In any case, there is mapping and there is mapping right. I’m not sure how you’d know the difference – until you installed it and tried to use it.
I would claim – for the reporting and SMF processing I maintain6 – my experience of the actual data is invaluable. And this experience transfer session proved to me that even beginning to replicate what the RMF Postprocessor produces takes all that experience. Anyone who just knows RMF reports will have a hard time with the raw data.
Beyond the mapping is, of course, reporting. Sample reports are vital here – and well-documented ones at that. And a thriving community of people using and sharing comes a close second.
I think my contribution to “community” is through blog posts like this one, podcasting and presenting at conferences7. Little of that would be possible without my real life experience of mangling 🙂 data. And, no, I don’t think I’m giving anything regrettable away by telling you how I process data.
In short, I must be mad to maintain our code – until you consider the alternative: Blissful ignorance. Blissful right up to the point where real work has to get done. 🙂
So, when evaluating products or designing your SMF handling regime, consider what you can do with the results of any queries. ↩
But not the worst. ↩
Pick one z/OS system to report on the whole machine – or you’ll get duplication. ↩
For a deactivated LPAR the count is zero – and do indeed list these LPARs when reporting on machines. They’re interesting from the standpoint of LPAR recovery. ↩
Mapping the record really isn’t the hardest bit. ↩
And my friend and colleague Dave Betten also maintains a lot of our code with the same skill level. ↩
Which reminds me I ought to upload the slides from my presentations last week at Z Technical University in London. ↩