Data Collection Requirements

(Originally posted 2013-06-01.)

Over the years I’ve written emails with data collection requirements dozens of times, with varying degrees of clarity. It would be better, wouldn’t it, to write it once. I don’t think I can get out of the business of writing such emails entirely but here’s a goodly chunk of it.

Another thing that struck me is that the value of some types of data has increased enormously over time. So some data that might’ve been in the “I don’t mind if you don’t send it” category has been elevated to “things will be much better if you do”.

I’d like to articulate that more clearly.

Bare Minimum

I always want SMF Types 70 through to 79, whatever you have.

I put it this loosely because I have almost 100% success with getting the data I really need, because the customer’s already collecting it. That has the distinct advantage of allowing me to ask for historical data, whether for longitudinal studies or just to capture a “problem” period.

There’s rarely been a study where I didn’t want RMF data. (Occasionally I’ve only wanted DB2 Accounting Trace.)

I like to have this data for a few days, but sometimes that isn’t possible. What really freaks my code out is having just a few intervals to work with.

Another question is “for which systems?” That’s a more difficult question to answer. Certainly I want all the major systems in the relevant sysplex. Ideally *all the systems in the sysplex and even all the systems on the machines involved. But that’s usually not realistic. You’ve probably guessed already that the nearer to that ideal the better the insights (probably).

Strongly Enhancing

As you’ll’ve seen from this blog Type 30 Subtypes 2 and 3 Interval records are of increasing value to me: I recently ran some 15 month old data through the latest level of my code and was gratified at how much more insight I gained into how the customer’s systems worked.

A flavour of this is described in posts like Another Usage Of Usage Information.

So this data has definitely moved from the category of “I can just break down CPU usage a little further” to “it will make a large difference if you send it”.

Here the systems and time ranges I’d prefer to see data from can be much less: I probably don’t need to see this data from the Sysprogs’ sandpit.

Nice To Have

With most studies I can get by without the WLM Service Definition but it helps in certain circumstances (as I mentioned in Playing Spot The Difference With WLM Service Definitions.)

I’m OK with either the WLM ISPF TLIB or the XML version (as mentioned in that post).

If I want to take disk performance down to below the volume level SMF 42–6 Data Set Performance records are a must. It’s also the case you can learn an awful lot about DB2 table space and index space fragments from 42–6, there being a well-known naming convention for DB2 data sets.

Specialist Subjects

The above is common to most studies. The following deals with more common specialist needs.

DB2

Most of the time I’m seeking to explain one of two things about DB2: * Where the CPU is going. * Where the time is going. In both cases I need DB2 Accounting Trace (SMF 101). The quality of this is variable. For example, to get CPU down to the Package (Program) level I need Trace Classes 7 and 8, in addition to the usual 1, 2 and 3. (Sometimes even 2 and 3 aren’t on.)

It’s quite likely this isn’t data that in its full glory is being collected all the time. So it’s a 30% chance of getting this retrospectively.

Sometimes I’m keen to understand the DB2 subsystem, which is where Statistics Trace comes in. The default statistics interval (STATIME) used to be a horrendous 30 minutes. Now it’s much lower so I’m pleased that issue has gone away. I ask for Trace Classes 1,3,4,5,6,8,10 which result in SMF 100 and 102 records. (I don’t ask for Performance Trace which also results in 102 records, albeit different ones.)

Again the questions of “for which subsystems?” and “when for?” come into play. That’s where negotiation is important:

  • It’s a lot of data to send.
  • Some installations deem it too expensive to collect on a continual basis.

I don’t disagree with either of those.

CICS

Here Statistics Trace (SMF 110) is really useful – especially if you have a sensible statistics interval.

For application response time and CPU breakdown to the transaction level Monitor Trace is the thing. Again this is the sort of thing customers don’t keep on a regular basis – or for that many regions. It’s also quite prone to breakage: Some customers remove fields from the record with a customised Monitor Control Table (MCT).

I try to glean what I can from SMF 30 about CICS – as numerous blog posts have pointed out – because I can get it for many more CICS regions than the CICS-specific data would furnish.

Batch

Batch is the area that takes in a widest range of data sources, the most fundamental of which are SMF 30 Subtypes 4 (Step-End) and 5 (Job-End).

I’ve already mentioned DB2 Accounting Trace and it’s most of what you need for understanding the timings of DB2 jobs.

For VSAM data sets SMF 62 is OPEN and 64 is CLOSE. For non-VSAM SMF 14 is Reads and 15 is Writes.

For DFSORT SMF 16 is really handy and even better if SMF=FULL is in effect. (Often it isn’t but I generally wouldn’t stop data collection to fix that.)

MQ

I only occasionally look at Websphere MQ, though I’d like to do much more with it. I don’t think many people are familiar with the data. Statistics Trace (analogous to DB2’s but different) is SMF 115. Accounting Trace – which deals with applications – is SMF 116.

If I want to see what connects to MQ the Usage Information in SMF 30 is generally enough – though it doesn’t tell me much about work coming in through the CHIN address space. For that I really do need Accounting Trace. (An analogous information can be made about remote access to DB2.)

Getting Data To Me

This is usually OK but it’s worth reminding people of a few simple rules:

  1. Always use IFASMFDP (and / or IFASMFDL) to move SMF records around. (Using IEBGENER will probably break them.)
  2. TERSE the data using AMATERSE (or perhaps TRSMAIN). This works fine for both SMF data and ISPF TLIBs (and I suppose partitioned data sets in general).
  3. FTP the data BINARY to ECUREP.
  4. Make sure you send the data to the right directory in ECUREP’s file store. The standard encoding of the PMR number helps a number of IBM systems (such as RETAIN) to work swiftly and effectively. I’ll give you a PMR number on my queue (UZPMOS) or we can use another one.

That seems like a lot of rules but most of it should be familiar to anyone who’s ever sent in documentation in support of a PMR. (Only Rule 1 is new.) If you have access to the PMR text – and quite a few customers do – it should also enable you to track the data inbound.

In Conclusion

Realistically people might not have all the data I want and so there’s a process of negotiation, mainly between timeliness and retrospectiveness versus quality. Clarifying that trade off would be helpful, which is why I like to run a Data Collection Kick-Off call. Ideally that call would be face to face, but I’m less insistent on that if distance makes it difficult.

At the end of the day I’m quite flexible and do what I can with whatever data I get. Of course you can’t magic missing data out of thin air, and can only occasionally repair it.

What I hope is that data collection is not overly burdensome and doesn’t cause stress to the customer. I also like to think that when they’ve sent the data in they can relax and more or less forget about it until “showtime”. 🙂

I also hope customers understand why they’ve been asked for the data they have. And that’s part of the point of this post, the rest being articulating what I need.

Published by Martin Packer

I'm a mainframe performance guy and have been for the past 35 years. But I play with lots of other technologies as well.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: