Suffering Subsystems

(Originally posted 2016-02-14.)

I wish I’d started counting DB2 subsystems before.

A recent study saw 43 DB2 subsystems, in 13 Data Sharing groups (and a few in none), across a large number of z/OS systems.

And if I try to remember other studies these numbers have been typical of them (but this is not a typical set of numbers).

Two thoughts entered my head:

  • How on earth do you get to these sorts of numbers, and is it a blessing or a nuisance?
  • How can you depict your DB2 estate?

This post is about the latter. I might come back to the former.

I want to share a technique I used that you might want to emulate. At any rate it generates diagrams I think you’ll find easy on the eye.

My Motivation

I’m always looking at new ways of depicting things for two reasons:

  • Because I spend way too long generating “orientation” information about customers. I’m lazy, or impatient, or an efficiency-seeker if you prefer. πŸ™‚
  • Because I think there are fresh insights to be had.

As I hinted, I think customer mainframe estates have become more and more complex. So the need for better tooling has become acute.

Source Material

To capture your DB2 estate you need, unsurprisingly, to use SMF 30 Interval records. I’ve written about this many times. But here are a couple of specifics:

  • I look for job names ending with “IRLM” to represent the DB2 subsystem.[1]. This I plug into a query against the SMF 74–2 XCF data and retrieve the group name, throwing away any beginning with “IXCLO”. This gives me a “group name” which I can use to find others in the same group. [2]
  • To establish which CICS regions talk to which DB2 subsystem I use the DB2 SMF 30 Usage Data Section – for address spaces with program name “DFHSIP”.

If you read the footnotes you’ll see this isn’t 100% ideal but it certainly gets you a lot of the CICS / DB2 topology; To me it’s architecturally useful stuff. The question is how to depict this network.

New Tooling

I’ve used mind maps before and one of my favourite tools for creating and manipulating them is iThoughts. There’s an iOS version and a Mac OS X version.

Yes, other tools are available but there’s a specific feature I really like that makes this the tool I’m going with: Comma-Separated Value (CSV) import.[3]

CSV is nice because:

  • It’s plain text and my REXX code can readily generate it.
  • You can pull it into a spreadsheet and edit it before saving it and pulling it into iThoughts.

One other feature I like of iThoughts is the ability to Filter on a text string. Actually you can do a Global Replace which I found useful in sanitising the screen shots for this blog post.[4]

As with most mind mapping tools I can move nodes and subtrees around very easily. I can also add notes such as when the CICS region or DB2 subsystem started.

Some Fragments

So here are a couple of fragments of mind maps my tool has been taught to generate the CSV for. The screenshots are indeed from iThoughts running on Mac OS X.

First, a shot of some DB2 subsystems – one set in a Data Sharing group, another not.

The grey colour was actually specified in the CSV file my code creates. It’s to draw attention to the fact the subsystems in that colour aren’t in a DB2 Data Sharing group. One day I could colour code the Data Sharing groups.

And now a shot of some CICS regions attaching to two DB2 subsystems in the same system:

Conclusion

The two screenshots above are quite pretty and very close to automatic now:

  • My code generates the CSV file automatically
  • I still have to download it and throw it into iThoughts

That isn’t really burdensome.

The nice thing is I have a mind map or two I can rearrange and edit. And there are some more nice tricks like the ability to have my code generate notes for each node and have iThoughts import them at the same time as the actual topology data.

So if I get bored I can see ways to enhance this.

So, I’m sure you could do this with other mind mapping tools. The point of this post, however, is to encourage you to experiment with this kind of depiction. Have fun!


  1. The IRLM address space might not have the same characters before the “IRLM” as e.g. the DBM1 address space begins with.  ↩

  2. This is the IRLM XCF group, not the DB2 Data Sharing Group. The latter is not available unless you do something clever with SMF 74–4 Coupling Facility data. (And I haven’t got there yet.)  ↩

  3. Just as there are other Mind Map tools there are other text-file based formats, such as Freemind and OPML.  ↩

  4. It might interest you to know I’m using the Duet iOS app to provide a second screen and using iOS’ built-in screen shot capability to capture sections of the map.  ↩

DDF Batch

(Originally posted 2016-01-24.)

DDF and Batch sound like two opposite ends of the spectrum, don’t they?

Well, it turns out they’re not.

I said in DDF Counts I might well have more to say about DDF. I was right.

I’ve known for a long time that some DDF work can come in from other z/OS DB2 subsystems, but not really thought much about it.

Until now. And I don’t really know why now. πŸ™‚ Maybe it’s just because I’m “in the neighbourhood”.

Why Is Batch DDF An Important Topic?

We look at batch jobs in lots of ways but until now we’ve not considered the case where a batch job goes to DB2 for data but the data is really in a different DB2.1

But if a DB2 job does go elsewhere for its data the performance of getting it clearly affects the job’s run time.

There are at least two different aspects to this:

  • The network traffic.
  • The remote DB2 access time.

How Do You Understand A Job’s Remote DB2 Performance?

First you have to detect an external DB2 batch job. Then you need to analyse its performance.

The latter is just the same as any other DB2 batch job, so I won’t dwell on it here. So let’s consider how to detect batch jobs that come in through DDF.

Detecting An External DB2 Batch Job

Let’s assume you have a bunch of SMF 101 (DB2 Accounting Trace) records with QWHCATYP of QWHCRUW or QWHCDUW – denoting DDF.

If field QMDAATYP contains “DSN” the DDF 101 record relates to a remote z/OS system. But these records could be, for example, from a remote CICS transaction.

You can detect remote batch jobs from the SMF 101 record by observing when field QMDACTYP contains “BATCH”. Typically QMDACNAM might contain “BATCH” or “DB2CALL”.

If it is Remote DB2 Batch the first eight characters of the remote Correlation ID (QMDACORR)2 are the job name.

Obtaining the step number and name can be done by using timestamp analysis, comparing this record’s timestamps to SMF 30 for the job on its originating system.

One snag with the term “originating system” is that the 101 record doesn’t actually tell you the originating system’s SMF ID. But it will give you some network information, from which you can probably work it out.

Now We Have Two Records To Analyse. Is This Better Than One?

So now we have two SMF 101 records for the job3:

  • The one on the job’s originating system.
  • The DDF one on the system it connects to via DDF.

As I pointed out at the end of this discussion thread in 2005 the originating job’s 101 record might contain substantial DB2 Services Wait Other time – which would be the time spent over in the system whose data it accessed.

So I would advocate a two step process:

  1. Analyse the job’s home DB2 101 to discover the big buckets of time and tune down – as usual.

  2. If the DB2 Services Wait Other time is substantial then understand the time buckets in the other 101 record (the one on the system it connects to via DDF).

Actually there is a third aspect: If your concern is actually the CPU time this job causes on the system it connects to via DDF then obviously the DDF 101 is the one you want.

So I think you can do good work with the pair of 101 records – so long as you’re collecting 101s from both DB2 subsystems and processing them appropriately.

What About The Network Traffic?

While you can’t directly see the network time you can see the traffic: The QLAC section in the 101 record gives you such things as SQL statements transmitted, rows transferred, bytes transferred etc.

I think this is useful information – and you might actually be able to do something about it.

Conclusion

Part of the purpose of this post was to sensitise Performance people to the possibility that their batch might be using DDF (and indeed that some of the DDF traffic might be coming from remote z/OS batch jobs).

The other part of the purpose was to outline how you might go about analysing the performance of such batch jobs.

In my code I have a new report that covers this ground. Naturally it’ll evolve – and I expect I’ll be asking customers whose DB2 Batch I study for SMF 101 data from any DB2 subsystems they think it accesses remotely.


  1. For simplicity I’ll write as if the access is read. In reality, of course, update is quite likely. 

  2. For CICS, in contrast, the middle 4 characters are the CICS transaction name. 

  3. I’ve simplified here. In reality the job might be multi-step, so you would then get more than 2 SMF 101 records. 

DDF Counts

(Originally posted 2016-01-17.)

I don’t think I’ve ever written very much about DDF. Now seems like a good time to start.

I say this because I’ve been working pretty intensively over the last couple of weeks on upgrading our DDF Analysis code. Hence the recent DFSORT post (DFSORT Tables).

I’m actually not the DB2 specialist in the team but, I’d claim, I know more about DB2 than many people who are. At least from a Performance perspective.

Actually this post isn’t about how to tune DDF. It’s about how to categorise and account for DDF usage. As usual I’m being curious about how DDF is used.

The Story So Far

A long time ago I realised it would be possible and valuable to “flatten” parts of the SMF 101 Accounting Trace record. A DDF 101 record is cut at every Commit or Abort – in principle.

And the DDF 101 record has additional sections of real value:

  • QMDA Section (mapped by DSNDQMDA) has lots of classification information, most particularly detail on where requests are coming from.
  • QLAC Section (mapped by DSNDQLAC) documents additional numbers such as rows transmitted.

In addition a field in the standard QWAC Section (mapped by DSNDQWAC) documents the WLM Service Class the work executed in. This field (QWACWLME) is only filled in for DDF.

So I wrote a DFSORT E15 exit to reformat the record so that all the useful DDF information is in fixed positions in the record. This makes it easy to write DFSORT and ICETOOL applications against the reformatted data. (In our code this data is stored reformatted on disk, one output record per input record.)

These reports concentrated on refining our view of what applications accessed DB2 via DDF. So, for example, noticing that the vast majority of the DDF CPU was used by a JDBC application (and its identity).

I also experimented with writing a DDF trace – using the time stamps from individual 101 records. Because installations can now consolidate DDF SMF 101 records[1] (typically to 10 commits per record) this code has issues. [2]

This code was good for “tourist information” but showed a lot of promise.

2015 Showed The Need For Change

A number of common themes showed the need for change, particularly in 2015.

There were some defects. The most notable was the fact that DB2 Version 10 widened a lot of the QWAC section fields. But also my original design of converting STCK values to Millisecond values was unhelpful, particularly when summing.

But these are minor problems compared to two big themes:

  • Customers want to control DDF work better, particularly through better crafted WLM policies.
  • Customers want to understand where the DDF-originated CPU is going, with a view to managing it down.[3]

These two themes occurred in several different customer engagements in 2015, but I addressed them using custom queries.

New Developments

So now in early 2016, while waiting for an expected study to start, I’m enhancing our DDF code. With the test data (from a real customer situation) the results are looking interesting and useful.

Time Of Day Analysis

My original code always broke out the SMF record’s date and time into separate TSDATE, TSHOUR, TSMIN, etc fields. This means I can create graphs by time of day with almost arbitrary precision.

My current prototype graphs with 1-minute granularity and (less usefully) with 1-hour granularity. And there are two main kinds of graph: Class 1 CPU and Commits.

With my test data (actually to be fed back to the customer) the Commits show an interesting phenomenon: Certain DDF applications have regular bursts of work, for example on 5 minute and 15 minute cycles. Normally I don’t see RMF data with 1-minute granularity so don’t see such patterns. Bursts of work are more problematic, including for WLM, than smooth arrivals. Now at least I see it.

What follows is an hour or so’s worth of data for a single DB2 subsystem, listing the top two Correlation IDs.

Here’s the Commits picture:

As you can see there are two main application styles [4], each with its own rhythm. These patterns are themselves composite and could be broken out further with the QMDA Section information, right down to originating machine and application.

And here’s the corresponding CPU picture:

It shows the two applications behave quite differently, with the cycles largely absent. This suggests the spikes are very light in CPU terms and that other e.g. JDBC applications are far more CPU-intensive. Notice how the CPU usage peaks at 2 GCPs’ worth (120 seconds of CPU in 60 seconds). The underlying JDBC CPU usage is about 1 GCP’s worth.

zIIP

I’ve also added in three zIIP-related numbers:

  • zIIP CPU
  • zIIP-eligible CPU
  • Records with no zIIP CPU in

zIIP CPU is exactly what it says it is: CPU time spent executing on a zIIP.

zIIP-eligible CPU has had a chequered history but now it’s OK. It’s CPU time that was eligible to be on a zIIP but ran on a General-Purpose Processor (GCP).

The third number warrants a little explanation: With the original implementation of DDF zIIP exploitation every thread was partially zIIP-eligible and partially GCP-only. More recently DB2 was changed so a thread is either entirely zIIP-eligible or entirely GCP-only. By looking at individual 101 records you can usually see this in action.

So I added a field that indicates whether the 101 record had any zIIP CPU or not – and I count these. Rolling up 101 records complicates this but my test data suggests the all-or-nothing works at the individual record level.

Workload Manager

Last year I did some 101 analysis to help a customer set up their WLM policy right for DDF.

Nowadays I get the WLM policy (in ISPF TLIB form) for every engagement.[5] So I can see how you’ve set up WLM for DDF. What’s interesting is how it actually plays out in practice.

And this is where the 101 data comes in:

  • QWACWLME tells me which WLM Service Class a transaction runs in (only for DDF).
  • The record has CPU data which enables you to calculate CPU Per Commit.
  • You also get Elapsed Time Per Commit.

It would be wonderful if SMF 101 had the ending Service Class Period but it doesn’t. But at least you can do statistical analysis, particularly to see if the work is homogenous or not.

Similar statistical analysis can help you set realistic Response Time goals.

Here are a couple of graphs I made by teaching my code how to bucket response times and Class 1 TCB times, restricting the analysis to JDBC work coming into a single DB2 subsystem: [6]

Here’s the response time distribution graph:

If this were a single-period Service Class one might suggest a goal of 95% in 0.5s – though if that were the case it’d be an unusually high value (0.5s).

You can also see attainment wasn’t that variable through this (75 minute) period.

And here’s the Class 1 TCB time distribution graph:

Notice how the bucket limits are lower than in the Response Time case. In my code I can fiddle with them separately.

Over 95% of the transactions complete using less than 15ms of CPU.

These two graphs aren’t wildly interesting in this case but they illustrate the sort of thing we’ll be able to see as analysts with the new code. I think it’s a nice advance.

In Conclusion

It’s been hard work and I’ve rewritten much of the reporting code. But I’m a long way forwards now.

It’s interesting to note the approach and, largely, the code is extensible to any transaction-like access to DB2. For example CICS/DB2 transactions. “Transaction-like” because much of the analysis requires frequent cutting of 101s. Most batch doesn’t look much like this.

I’d also encourage customers who have significant DDF work and collect 101 records to consider doing similar things to those in this post. This is indeed a rich seam to mine.

And I think I just might have enough material here for a conference presentation. Certainly a little too much for a blog post. πŸ™‚

As always, I expect to learn from early experiences with the code. And to tweak and extend the code – probably as a result of this.

So expect more posts on DDF.


  1. By the way I count Commits and Aborts per record And can see where DDF rollup is occurring and whether the default 10 is in effect.  ↩

  2. But I think I have a workaround of sorts.  ↩

  3. Or at least doing Capacity Planning.  ↩

  4. Which are JDBC (Java) and (presumably) “Data Flow Engine”, whatever that is.  ↩

  5. It’s not quite that I won’t talk to you without it but pretty close. πŸ™‚  ↩

  6. I’m making the graphs add up to 100%. I could, of course, use absolute values. And that might be useful to figure out if we even have enough transaction endings to make Response Time goals useful.  ↩

DFSORT Tables

(Originally posted 2016-01-10.)

It’s been a while since I posted a DFSORT trick – and it’s high time I did.

This post follows (distantly) on from More Maintainable DFSORT and is occasioned by some recent development work on our tools.

As so often happens, developing this code has been a bit of a journey of discovery. And I’ve learnt (the hard way) a couple more ways you can make the code more maintainable.

Let me straight away share a few of these with you – in case you use [1] DFSORT but don’t want to read much further.

  • Where possible specify fields on separate lines. So, rather than writing

    INREC FIELDS=(A,B,C)
    

    write

    INREC FIELDS=(A,
      B
      C)
    

    In fact the above mentioned post contained this advice, but in my recent development work it’s proved invaluable.

  • Consider a padding final field on eg OUTREC:

    OUTREC FIELDS=(A,
        B,
        X)
    

    where the ‘X’ is a single blank specifier. [2] That way when you move fields around or delete them you don’t need to worry about the trailing bracket – as it’s after the invariant X.

  • You can decode a STCK value into printable seconds with

    TIMESTAMP,DIV,+4096000,EDIT=(IIIT.TTT)
    

Using IFTHEN To Make A Table

This is the “meat” of this post.

Suppose you want to produce a report that is a grid or table, with the same type of value in each cell.

Consider the following input data set:

ALPHA       WHITE  7
ALPHA       BLUE   1
BRAVO       RED    3
ALPHA       WHITE  4
ALPHA       RED    8
BRAVO       RED   11
CHARLIE     BLUE  67
BRAVO       RED   34
BRAVO       WHITE 57
CHARLIE     BLUE   8
ALPHA       WHITE 34
CHARLIE     BLUE  81
DELTA       RED   24
ECHO        BLUE   9
FOXTROT     RED    7

Three columns, of which the third is numeric though character rather than binary.

Now look at this report, produced from the data:

 Division       RED Sold WHITE Sold  BLUE Sold Other Sold   RED Txns WHITE Txns  BLUE Txns Other Txns
 ------------ ---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
 ALPHA                 8         45          1          0          1          3          1          0
 BRAVO                48         57          0          0          3          1          0          0
 CHARLIE               0          0        156          0          0          0          3          0
 DELTA                24          0          0          0          1          0          0          0
 ECHO                  0          0          9          0          0          0          1          0
 FOXTROT               7          0          0          0          1          0          0          0

The second field from each input data record is used to define the column, while the first field is used to define the row.

The input data is mapped with the following DFSORT Symbols (SYMNAMES DD) statements:

* MAPPING OF ORIGINAL RECORDS
POSITION,1
DIVISION,*,12,CH
COLOUR,*,6,CH
SOLD,*,2,CH

In picture form:

To achieve the desired results in the report using the following DFSORT statements.

First INREC to reformat the data:

 INREC IFTHEN=(WHEN=INIT,
   BUILD=(DIVISION,
     COLOUR,
* SOLD VALUE IN INPUT RECORD CONVERTED TO BI
     SOLD,UFF,TO=BI,LENGTH=2,
* SOLD TALLIES
     X'0000',
     X'0000',
     X'0000',
     X'0000',
* RECORD TALLIES
     X'0000',
     X'0000',
     X'0000',
     X'0000')),
 IFTHEN=(WHEN=(_COLOUR,EQ,C'RED   '),
    OVERLAY=(_RED_SOLD:_SOLD,_RED_RECS:X'0001')),
 IFTHEN=(WHEN=(_COLOUR,EQ,C'WHITE '),
    OVERLAY=(_WHITE_SOLD:_SOLD,_WHITE_RECS:X'0001')),
 IFTHEN=(WHEN=(_COLOUR,EQ,C'BLUE  '),
    OVERLAY=(_BLUE_SOLD:_SOLD,_BLUE_RECS:X'0001')),
 IFTHEN=(WHEN=NONE,
    OVERLAY=(_OTHER_SOLD:_SOLD,_OTHER_RECS:_SOLD))

Second a SORT statement:

 SORT FIELDS=(_DIVISION,A)

Third a SUM statement:

 SUM FIELDS=(_RED_SOLD,
  _WHITE_SOLD,
  _BLUE_SOLD,
  _OTHER_SOLD,
  _RED_RECS,
  _WHITE_RECS,
  _BLUE_RECS,
  _OTHER_RECS)

And fourth an OUTFIL statement:

 OUTFIL FNAMES=SORTOUT,REMOVECC,
 HEADER1=('Division    ',X,
   '  RED Sold',X,
   'WHITE Sold',X,
   ' BLUE Sold',X,
   'Other Sold',X,
   '  RED Txns',X,
   'WHITE Txns',X,
   ' BLUE Txns',X,
   'Other Txns',X,/,
   '------------',X,
   '----------',X,
   '----------',X,
   '----------',X,
   '----------',X,
   '----------',X,
   '----------',X,
   '----------',X,
   '----------'),
 OUTREC=(_DIVISION,X,
   _RED_SOLD,EDIT=(IIIIIIIIIT),X,
   _WHITE_SOLD,EDIT=(IIIIIIIIIT),X,
   _BLUE_SOLD,EDIT=(IIIIIIIIIT),X,
   _OTHER_SOLD,EDIT=(IIIIIIIIIT),X,
   _RED_RECS,EDIT=(IIIIIIIIIT),X,
   _WHITE_RECS,EDIT=(IIIIIIIIIT),X,
   _BLUE_RECS,EDIT=(IIIIIIIIIT),X,
   _OTHER_RECS,EDIT=(IIIIIIIIIT),X,
   X)

The data reformatted with INREC is mapped with these Symbols (in the same file as the input data symbols):

* RESULTS OF INREC
POSITION,1
_DIVISION,*,12,CH
_COLOUR,*,6,CH
_SOLD,*,2,BI
_RED_SOLD,*,2,BI
_WHITE_SOLD,*,2,BI
_BLUE_SOLD,*,2,BI
_OTHER_SOLD,*,2,BI
_RED_RECS,*,2,BI
_WHITE_RECS,*,2,BI
_BLUE_RECS,*,2,BI
_OTHER_RECS,*,2,BI

The data remains formatted this way until the OUTFIL statement produces the final report to the REPORT DD.

Mostly this is complicated stuff so let me take you through it, statement by statement.

INREC

The INREC statement reformats the record to look like this:

Here two sets of four new fields have been added to the input record. They are mapped with (previously-shown) symbols:

_RED_SOLD,*,2,BI
_WHITE_SOLD,*,2,BI
_BLUE_SOLD,*,2,BI
_OTHER_SOLD,*,2,BI
_RED_RECS,*,2,BI
_WHITE_RECS,*,2,BI
_BLUE_RECS,*,2,BI
_OTHER_RECS,*,2,BI

Also the SOLD field has been converted to a 2-byte Binary field:

_SOLD,*,2,BI

So this is all achieved with a set of IFTHEN “stages”, looking a lot like a pipeline:

  • IFTHEN WHEN=INIT is always performed – and first. It primes the counter fields (with 3 bytes of Binary zeroes apiece) and uses SOLD,UFF,TO=BI,LENGTH=2 to convert the SOLD field to Binary.
  • IFTHEN WHEN=(_COLOUR,EQ,C’RED ‘) is used only for records where the COLOUR field is ‘RED ’ to copy the _SOLD Binary value into the _RED_SOLD field and to write Binary 1 (X’0001’) to the _RED_SOLD field.
  • Likewise the next two IFTHEN clauses, which do the same for ‘BLUE ’ and ‘WHITE ’.
  • IFTHEN WHEN=NONE is performed only for records where none of the previous IFTHEN WHEN conditions were met. With the exception of the WHEN=INIT clause. It copies the SOLD value into _OTHER_SOLD and Binary 1 into _OTHER_RECS.

After INREC the number of records is the same but the SOLD field is copied into the right _SOLD field and 1 into the _RECS field.

The left hand side of the data at this point looks like:

ALPHA                 0          7          0          0          0          1
ALPHA                 0          0          1          0          0          0
BRAVO                 3          0          0          0          1          0
ALPHA                 0          4          0          0          0          1
ALPHA                 8          0          0          0          1          0
BRAVO                11          0          0          0          1          0
CHARLIE               0          0         67          0          0          0
BRAVO                34          0          0          0          1          0
BRAVO                 0         57          0          0          0          1
CHARLIE               0          0          8          0          0          0
ALPHA                 0         34          0          0          0          1
CHARLIE               0          0         81          0          0          0
DELTA                24          0          0          0          1          0
ECHO                  0          0          9          0          0          0
FOXTROT               7          0          0          0          1          0

For legibility I’ve left the last few columns out and in fact what you’re seeing is formatted so you can read the Binary numbers.

At this point there’s been no summation.

SORT

I sort on the _DIVISION field, which is really the same as the DIVISION field.

SUM

I sum the 4 _SOLD and the 4 _RECS fields. To show you the result of this in a viewable form I’d pretty much be showing you the final result (and I’ve already done that).

OUTFIL OUTREC

While there are some very sophisticated uses of OUTFIL this one is a simple case of report writing:

  • HEADER1 just prints a one-time header line (or two). The ‘/’ just specifies a new line.
  • OUTREC reformats the records passed, in this case making them printable. For example _RED_SOLD,EDIT=(IIIIIIIIIT) converts _RED_SOLD to a printable number with leading zeroes suppressed.

Conclusions

The above worked example is readily adaptable. But it is a little bit fragile, as is so often the case with advanced DFSORT and ICETOOL applications. Once I got used to the basic technique – using a series of IFTHEN WHEN clauses to copy one input field into a series of different output fields depending on another field’s value – it became readily extensible and adaptable.

And some of the techniques in this post (and in More Maintainable DFSORT) made this much easier.

Some things to note:

  • You have to know how many columns you want and what (in this case) COLOUR field values to expect. In my real world example I took full advantage of the _OTHER fields to ensure I captured them all.[3]
  • This example shows you can have 2 fields “gridded” like this. In this case one is just a count but I’ve done this with two and indeed three distinct input record fields.

  • This whole technique depends heavily on DFSORT Symbols.

I appreciate this has been a long and fiddly post. Perhaps we can hope for something more succinct soon. πŸ™‚


  1. By “use” I mean “write DFSORT / ICETOOL statements” rather than “run jobs”.  ↩

  2. With OUTFIL if you specify a header (eg HEADER1) you might want to pad the OUTREC with lots of blanks, eg 50X. But you wouldn’t want to do that with high-volume Production data.  ↩

  3. In Production I have another query which tells me what values to expect.  ↩

Tis The Season

(Originally posted 2015-12-22.)

I can cope with both “zee” and “zed”. [1]

I love the myriad ways of pronouncing “CICS”.

I can even detect such things as “Day Bay Tway” when I hear them.

But there are a couple of things that I’m slightly bemused by:

  • People saying “zee-oss” or “zed-oss” or “zoss”.
  • People calling WLM “Willem”.

I wonder if it’s a more modern version of what I’d call “Amdahl Coffee Mug Syndrome”…

When I first started at IBM customers used to goad IBMers by offering them coffee in an Amdahl mug. I don’t know how we were supposed to react but I’d just say “thank you very much” and accept the coffee. No point in being goaded.

I actually try to pronounce everything right – even at the risk of sounding like a poseur. Even competitive products. And certainly the names of people.

I’d encourage everybody to do the same.

And with that brain dump of misanthropy πŸ™‚ into my phone on the last flight home I’ll take this year (back from Istanbul – where the people are friendly and the beer’s the beer) πŸ™‚ …

I’ll wish everyone Happy Holidays!

Or, if you prefer…

Bah[2] Humbug! πŸ™‚


  1. I normally get this right, knowing that some countries prefer one, some the other, and that some don’t care.  ↩

  2. The SwiftKey keyboard on the phone first rendered “Bah” as “Bahamas”. You might prefer that. πŸ™‚  ↩

Overdoing It

(Originally posted 2015-12-22.)

WLM will give up on an unachievable goal, eventually. Recently I came across a customer who didn’t know this and for whom this was a big problem.[1]

This customer, like many others, was running heavily constrained for CPU. [2]

But it does have consequences.

In this particular case they had defined two service classes – one for their main Production IMS address spaces and one for their Production DB2 subsystem.

The goals were both with Importance 1. One had a Velocity goal of 99% and the other of 95%.

I joked they’d’ve specified 100% if they could. One of them said the panel only allowed 2 digits. πŸ™‚

In both cases the goals were set way too high and the velocity attainment would fall well short of the goal. **Much of the time the Performance Index (PI) would be so high[3] that WLM would give up on the goal.

Of course the customer had several service classes with work using IMS and DB2 and with lesser importance (2, 3, etc) and with more modest goals.

These “lesser” service classes tended to meet or even exceed their goals. But it doesn’t mean the applications performed well or stably in real world terms. You need the server to perform well for that to happen. And in this case IMS and DB2 were starved of CPU – both GCP and zIIP. They became donors rather than receivers. And priorities were effectively inverted.

So this effective inversion of priorities was damaging, particularly at higher utilisation levels.

The moral of the story is: Don’t overdo it, goalwise and do check goal attainment, adjusting goals sensibly based on that. Otherwise you could be in for priority inversion and a nasty surprise.


  1. And, coincidentally, an IBMer who didn’t know it either.  ↩

  2. That, of course, is their prerogative.  ↩

  3. The higher the PI the worse the goal attainment, with a PI of 1 meaning “just met the goal”.  ↩

WLM Policy Timestamp Analysis

(Originally posted 2015-12-19.)

After writing Reviewing The Situation I got thinking. [1]

I’ve known for a long time the WLM Policy (XML) has timestamps in it. The thought was “maybe there’s value in doing timestamp analysis”.

Here is a fragment of a real customer policy, showing a resource group definition:

It’s pretty easy to read. Obviously the XML elements whose node name start with “Creation” or “Modification” are of interest here.

So I modified my PHP code to produce the following two tables:

I’ve tested this with a couple of customers. Basically it counts which years things were created and also modified.

In the real life example there is a gap of a couple of years – a few years ago – but otherwise the story is one of continual maintenance.

In the case of the other test customer it was interesting to hear them translate userids into names; Some of the people were still working for the customer while others had retired. While this might seem like “tourist information” I do believe quite a bit of the job I do is social.

So these two customer cases aren’t huge leaps forward, but it’d be interesting to see what happens when I encounter a customer who hasn’t maintained their WLM policy recently.

There are some issues with this method:

  • The precise items created and updated aren’t reflected in this current report, nor are the precise changes e.g. How a goal’s velocity was altered.
  • The granularity of items changed isn’t great.
  • For an item that has been modified the data only contains the created and last modified date, with no hint of any intermediate changes.

One thing I could fix is producing a more detailed report; For now I have to hunt for the time stamps in the HTML report I produce. So, for example, knowing when a bunch of classification rules were added could be interesting.

As I said at the beginning it was a thought; I’m not convinced there’s a huge amount of value in this but, as with so many new data, evolving code and more experience with real customer situations I might change my mind.[2]


  1. Perhaps I should’ve thought first, and written second. Now, now, settle down. πŸ™‚  β†©

  2. I’m not actually a pessimist but data always looks least useful right at the beginning.  β†©

Reviewing The Situation

(Originally posted 2015-12-14.)

I might have written about this before but it’s such a nebulous subject Web searches don’t enable me to tell. In any case it’s a subject worth reviewing every now and then.

The subject is “when to review your WLM policy”.

I’ve written extensively on how to look at a policy.

While I think you should read Analysing A WLM Policy – Part 1 I want to refer to something I wrote in Analysing A WLM Policy – Part 2.

I talked about 3 categories of WLM policy:

And I noted it was Category 3 that was the most problematic.

I’ve reviewed a fair few WLM policies since then – and I stand by what I said.

But, as is so often the case, applying a “calendar line” view is useful here.[1]

If I were to hatch a rule it would be “all WLM policies evolve to Category 3”.

No policy should remain static, and there are as many cases where the policy should have evolved but didn’t as there are of unhelpful changes.

Two examples:

  • The machine configuration changed but velocity goals weren’t re-evaluated.
  • Response time goals weren’t adjusted to meet the needs of the business.

The evolution towards Category 3 occurs over time, for example as new workloads appear. (And when they disappear clean up seldom happens.)

So, aside from explicit reasons like new hardware, or new applications, I think that someone should take a good look at a WLM policy every few years. Almost inevitably it will have deteriorated in that time.

By the way I don’t care who does the review – so long as they’re competent. While I get to see my fair share of WLM policies, it’s not my prime job. (Though it is a key topic in many customer conversations.) So it’s not my intention to sell you Services.

Talking of “competent”, one thing I like to emphasise is remaining plugged into the (evolving ) folklore. For example conferences, Redbooks, Facebook, Twitter, LinkedIn, user groups like MXG-L and IBM-MAIN, blogs like this (!), etc. Wherever people discuss stuff, in fact.

That way, if it is you reviewing the situation, you stand a good chance of doing a great job. Likewise of knowing when the time has come for such a review.

My main source of information on changes to a customer’s WLM policy is what I get when they send me the WLM ISPF TLIB.[2] I get:

  • Notes – and most customers use the policy notes to log changes.
  • Lots of “created” and “modified” footprints in the sand in the policy itself, complete with the userid of whoever made the change. This leads to interesting discussions sometimes. πŸ™‚

I’d be interested in hearing readers’ views on WLM policy maintenance.

I suspect, for instance, policy changes are often documented more fully in the installation’s Change Management system than in the notes in the policy.

I also suspect most customers are still using the WLM ISPF Application, rather than z/OSMF. I’ve no recommendation to make on this, except to note investment is most likely to be made in z/OSMF.


  1. cause nothing lasts forever, even cold November rain. πŸ™‚  ↩

  2. Generally there’s less breakage if I get a TLIB than XML, the latter usually requiring me to waste time repairing it with a text editor.  ↩

Thanks In Five Languages – ITSO 2015 Tour

(Originally posted 2015-12-10.)

I’ve been very lucky (and kept busy and challenged) these last two months.

In addition to my usual case load of customer situations I’ve had the enormous privilege of participating in the ITSO 2015 Mainframe Topics tour. I’ve presented whole-day sessions on Performance and Availability in five cities: Amsterdam, Paris, Warsaw, Vienna and Bromsgrove.[1]

The main topics have been:

  • Software Pricing and Performance Specialists’ role in it
  • z13
  • zEDC

I’ve learnt an enormous amount, and some of the questions have been really good. Several participants have opened my eyes by sharing experiences.

And I’ve met splendid people – both old friends and new.

I’ve injected some of my experiences, where I hope it’s been useful to.

So this is really a “straight out of my brain onto the page” thank you post to all who participated.

And I hope next year I get to author some slides of my own and take them on tour.


  1. Hence the “five languages”.  ↩

A Note On Velocity

(Originally posted 2015-12-07.)

Not to be confused with Notational Velocity.

A recent customer situation reminded me of how our code calculates velocity. It’s worth sharing with you.

The standard way of calculating velocity is to compute

(Using Samples)/(Using Samples + Delay Samples)

and convert to a percentage by multiplying by 100.[1]

The numbers are all recorded in SMF Type 72 Subtype 3.

We have two main graphs associated with Velocity for a single service class period:

  • How the velocity attained varies with the amount of CPU in the service class period.
  • What the Delay Samples and Using Samples are, by time of day, for the service class period.

You would expect the two graphs to agree – with the Using Samples as a proportion of the whole similar to the velocity data points. Indeed I hadn’t questioned that until this situation.

The surprise was that the Using Samples suggested a far higher velocity than that we computed. In detail, the Using Samples were dominated by Using I/O.[2]

The surprise was only momentary because our reporting also tells us that in this sysplex I/O Priority Management is disabled. This is unusual in my experience and one implication is that neither Using I/O nor Delay For I/O samples are included in the velocity calculation.

So why did my velocity calculation work? It’s because we use two key fields in the SMF 72–3. They are the headline Using (R723CTOU) and Delay (R723CTOT) Sample counts – which reflect how WLM itself calculates velocity. We don’t use the individual Delay an Using sample counts e.g Delay For CPU (R723CCDE) or Using zIIP (R723SUPU) in the velocity calculation.

A few things flow from this:

  • We could produce “With I/O Samples” and “Without I/O Samples” velocity calculations and use them to guide customers in adjusting their goals.
  • We could tally up Using and Delay samples and compare to the headline counts. This way we can see how complex things like zIIP samples play.

But those ideas are for another day or, more likely, another year (it being December now).

But let’s look at a worked (real) example. This is summing over 1 hour for the “DB2STC”[3] service class for 1 system.

The headline sample counts in that hour are:

Category Samples
Using 1101
Delay 1349
Idle 235912
Unknown 28571

If you calculate the velocity it’s about 45%. Also Using + Delay is about 6%, fairly typical for this kind of work, the vast majority being Idle.

Breaking down Using and Delay samples, using the explicit fields in 72–3:

Category Samples
Using CPU 928
Using zIIP 173
Delay CPU 1200
Delay zIIP 144
Delay For Swap In 5

The above doesn’t include Using I/O and Delay for I/O but the samples included do add up to the headline numberss. I’ve also excluded any zero-value counts, including “Using zIIP on CP”.

Now here are the I/O related sample counts:

Category Samples
Using I/O 14715
Delay for I/O 289

If these samples are added in the resulting velocity is 91%. In fact the goal is Importance 1, Velocity 70% – so the goal would be easily met if I/O Priority Management were enabled.

But that doesn’t necessarily mean better performance: Up to a point CPU queuing would be masked by the very strong Using I/O component. But a revised goal of, say, Importance 1 Velocity 90 with I/O in might be better.

Food for thought.


  1. Unknown Samples and Other Samples, while recorded by RMF, are not used in the calculation.  ↩

  2. Delay For I/O Samples were minimal.  ↩

  3. What’s in a name? It turns out this service class provably (from SMF 30, as we always do) contains the MSTR, DIST and DBM1 address spaces for the customer’s Production DB2 on this system.  ↩