Mainframe, Performance, Topics

Lost For Words With DDF

(Originally posted 2017-02-12.)

I'm lost for words with DDF, I really am.

“What's up with him?” my one reader asks. 🙂

So let me explain…

I debuted a presentation last year called “More Fun With DDF”. But I've made progress since then.

So what do I add to the front of this title? “Still”? “Yet”? “Even”?

I don't even think there's a hierarchy to these so it's a one shot deal tacking one of these on the front for 2017. “Even More Fun With DDF” is my favourite of these.

Caveat author! 🙂

So let's get to the meat of it: Something you might actually want to know…

DB2 Calling DB2

So I've been involved in a couple of situations where one DB2 on z/OS calls another, using DDF, recently.

I'll call the one that does the calling the Client DB2.
I'll call the one that is called the Server DB2.

The Client DB2 might call the Server DB2 on behalf of anything – such as CICS transactions, Batch Jobs, or even its own DDF clients¹.

For the rest of this post refer to this diagram, summarising key aspects of the SMF 101 (DB2 Accounting Trace) records.

Detecting Client And Server DB2 Subsystems

So how do we detect Client and Server situations?

Firstly the presence of a QLAC section in a SMF 101 (DB2 Accounting Trace) record tells you the 101 represents something participating in DDF – whichever role the DB2 is playing.

Secondly field QLACSQLS tells you this unit of work sent SQL requests somewhere – so it's acting as a Client. Similarly field QLACSQLR tells you it received SQL statements – so its acting as a Server.²

Matching DDF 101 Records

So, if I know that one DB2 is calling another I want SMF 101 (DB2 Accounting Trace) records from both DB2 subsystems. That should help me understand the conversation more fully. I will call these the Client 101 and Server 101 records, respectively.

But how do you match them up?

It turns out that timestamps are useless for this. But Logical Unit Of Work IDs are ideal – well the first 22 bytes of the 24. This is fields QWHSNID, QWHSLUNM, and QWHSLUUV concatenated.

Match these up and you're in business.³

Doing The Matching

I have code that reformats DDF 101s into records with important fields in fixed positions. With this code:

I reformat the Client 101s with important information, including the match fields, into fixed positions, with DFSORT COPY.
I reformat the Server 101s with important information, including the match fields, into the same fixed positions, with DFSORT COPY.
I use DFSORT JOINKEYS to join the two records together, extracting relevant fields from both the Server record and its matching Client record.

Actually I separate Batch, also CICS, also Other DDF joined records into their own data sets. For Batch “blow by blow” is appropriate; For CICS a statistical approach is better. So I have two CSV files, ripe for importing into a spreadsheet, for each of these.

Timings

Timings (and perhaps names) are the payoff for matching up these records.

The first thing to note is that normal (non-DDF) timings apply – in the QWAC and QWAX sections.

That's almost all you need to look at for the Server 101 record. Similarly, for the Client 101 record, the standard time buckets apply.

But there is a field – QWAXOTSE – that documents time waiting for the other DB2.⁴ It works both ways. And when its value is not explained by the 101's time buckets it can indicate communication problems.

Another piece of timing information is the end timestamps – the SMF record cutting time. What I've observed for Batch DDF is that the Server cuts its record a few minutes after the Client. My guess is this is because the Server realises the Client isn't coming back anymore; Some sort of idle timeout. I further suppose the QWACRINV field – the reason for invoking accounting – might provide the explanation But I really need more experience with this. I haven't seen the same effect with CICS DDF transactions, but then the overall numbers are much smaller.

Conclusion

It is perfectly possible to match up Client and Server DDF 101 records; Its value lies in getting a more complete view of such a DB2-to-DB2 conversation, complete with some extra diagnostic capability.

For example, knowing that a Batch DDF step's time is dominated by Synchronous Read I/O Wait in a specific different DB2 subsystem is useful. Or that QWAXOTSE dominates, unaccountably.

So this code is in Production and working fine.

As always, I expect my understanding to grow and the code to get refined. Both things tend to happen with more customer situations and data. You can be sure I'll relate any significant learning points here.

When a DDF call into a DB2 subsystem leads DDF calls out could be a really interesting case. ↩
Of course both could be non-zero. ↩
Actually you want the highest Commit Count (QWHSLUCC) for most purposes. ↩
I'm told this is only for the TCP/IP case, rather than SNA. I'm not sure how much of the latter I'll see. ↩

The Suite Spot

(Originally posted 2017-01-15.)

What is a batch suite?

That might seem like a silly question to ask but it’s inspired by some significant enhancements to our Batch reporting. Dave Betten and I have worked hard on these as time permitted over more than a year.

Traditional Definition Of A Suite

Traditionally, a batch suite is a set of related jobs, usually with some kind of a naming convention that makes them recognisable.

Such a naming convention might be ‘all jobs whose names begin with “XYZ” comprise the XYZ suite’.

Now, following a naming convention like this doesn’t guarantee relatedness. And not all naming conventions look like this. In fact many don’t.

Our Traditional Suite Reporting

Our motivation for reporting at a suite level is twofold:

Customers understand suites – because that’s how they designed their batch.
It’s a mid-way point in the hierarchy – between batch service classes / workloads and individual jobs.

So we use suites as a way of structuring the batch conversation.

We produce suite-level reporting (for the past 25 years) comprising such elements as:

A summary of the suite
Which jobs in the suite are released together
Job statistics
Step statistics
Job start delays
Data sets accessed by the suite
DB2 access by the suite

This set of reports has evolved somewhat over the years, and I’m skipping a lot of the detail.

What hadn’t changed was how we determined which jobs were in the suite: We were restricted to:

An explicit list of jobs – cumbersome to compile and manage.
Jobs with a single specific leading character string – in the spirit of “XYZ” above.

Neither of those is entirely satisfactory – so we got to work.

Enhancements To Our Tooling

As well as filtering on leading characters of a job name we can also filter on trailing characters (which we call “suffixes”).

Originally we only allowed one suffix. Now we allow multiple. For example “D”, “M”, “W”, “Q”.

We allow filtering on Service Class and Report Class
We allow filtering on Elapsed Time and CPU Time

As we’ve done this we’ve slowly re-architected the code and tweaked a few things, too. So, for example, we see all the RACF userids and group names.

How This Refines Our View Of A Suite

I guess we’re getting away from real suites with some of this. And this is a good thing:

A question we get asked a lot is how to reduce CPU – usually for software billing purposes – and so a pseudo-suite called “Big CPU Burners” is really handy.
When trying to reduce someone’s batch window a pseudo-suite called “Long Elapsed Time Jobs” helps.
Knowing which jobs are in e.g. “PRDBATHI” Service Class can be useful.

But we also have much more flexibility in defining real suites:

We have the extensions to job name filtering I mentioned above.
Sometimes customers will define a Report Class for a particular application.

So I think we’ve made real progress and it’ll enable us to help customers much better.

But I share all this because it might get you thinking about how to analyse and manage your batch estate better, too. For example, making more use of Report Classes to document suites could be handy. That would require cross-functional cooperation – between the people who create the JCL and the schedule and the WLM Keeper.

But a parting word on the value of real suites:

It’s really handy, when doing deeper analysis, to see a job’s predecessors and successors. So a pseudo-suite of “high I/O jobs”, for example, is unlikely to include many neighbours like that.

SMT – Some Actual Graphs

(Originally posted 2016-11-13.)

Back in the Summer I talked about z13 Simultaneous Multithreading (SMT) in Born With A Measuring Spoon In Its Mouth. I shared that I was feeling my way forward, and discovering others were doing likewise.

Here we are a few months later and my code has come on in leaps and bounds.¹

So I think it’s worth sharing some design stuff and a little discovery; I’m working on the principle that people have to embrace SMT on their own personal journey. ²

So let me show you a couple of graphs. I’ve obfuscated the system names on the graphs but otherwise they are “live”.

Changeable Things Need Graphing By Time Of Day

That is, of course, stating the obvious. But here is my graph that shows how some key metrics vary by time of day:

So, for example, Maximum Capacity Factor – being estimated from live measurements – varies by time of day and workload mix. Obviously, Capacity Factor – representing current load – also varies.

Notice how Average Thread Density – the average number of active threads when any are active – peaks during the day. This is a java-heavy workload, peaking in its use of zIIP during the day.

I’m not yet certain I’m wringing all of the insight out of the dynamics yet but I think this graph a good first step in that direction; My experience of this sort of thing is this graph will evolve a little – as I gain more experience.

Engine-Level Analysis Is Interesting

I’ve been meaning to create this sort of graph for a long time – and SMT provides the perfect excuse.

The x axis is processor (or thread) sequenced by Core ID.³

You’ll notice the general-purpose (CP) processors come before the zIIPs (IIP).

Generating a readable graph but without too many x axis label suppressions is tough. But note for the zIIPS each core has two CPUs (with SMT–2) whereas the CPs have one.

While – from the previous graph – the picture is dynamic I think there is value in this shift-level graph. Doing a 3-dimensional one wouldn’t be hard but I think it would be hard to consume. (Time would be the third dimension.)

In any case there’s some interesting stuff in this graph:

The Parked processors (in turquoise) are interesting: No GCPs are permanently parked but several are partially parked. For the zIIPs, however, it’s a different story: 6 permanently are – 3 cores. ⁴
Certain things come in pairs: LPAR Busy and Core Productivity – as they are at the core level, rather than the thread level.
That’s not entirely true: GCPs don’t exhibit the “paired” behaviour. But that makes sense: Only a single thread is enabled on a core.
For GCPs CPU Ids are even numbers; For zIIPs they’re both odd and even. The zIIP values didn’t surprise me. The GCP ones did – and I’ve seen this for two customers’ data sets now.
Some of the zIIP CPU Ids are up in the x’70’ onwards range. This surprised me and caused me to have to widen the CPU Id field to 5 characters. ⁵

Today a lot of the above looks like tourist information. My golden rule with tourist information is there’s high probability it’ll turn out to be diagnostic rather than just interesting – some day.

Conclusion

So, I’m quite pleased with the way these graphs turned out; They do illustrate some of the SMT behaviours.

Obviously experience will condition how this reporting evolves. Watch this (or some similar) space!

“That must be nice for you” y’all cry. 🙂 ↩
It might also help if I come calling and throw graphs at you. 🙂 ↩
I’ve chosen to print CPIDs as hex but coreids as decimal. ↩
I’ve wanted to plot Parked Processors for a long time now; SMT is just an excuse. ↩
CPU Id is two bytes and the SLR query returns it as a decimal number – which necessitates 5 decimal positions. ↩

Mainframe Performance Topics Podcast Episode 8 “Queue Me Up”

(Originally posted 2016-10-29.)

We wanted to get this episode out much sooner, but things conspired against us somewhat. Not least someone we really wanted to interview – to kick off a whole series of topics – having technical troubles.

So we went a different way from what we intended.

And we also had a few scheduling problems. But we’re here now. I hope it was worth the wait.

And just to repeat one thing: If you come anywhere near use we’re miked up. 🙂 Seriously, we’re conducting impromptu interviews when we’re out and about. Find us or avoid us, to taste. 🙂

Below are the show notes.

The series is here.

Episode 8 is here.

Episode 8 “Queue Me Up” Show Notes

Here are the show notes for Episode 8 “Queue Me Up”. The show is called “Queue Me Up” because:

Marna talks about moving up to higher z/OS releases…or releases “in the queue”.
Martin talks about the Coupling Facility list structures…or “queues”.

We had some follow up:

IBM Doc Buddy – available for iOS and Android.

The latest iOS level is currently 1.1.1. The latest Android level is currently 1.1.2. Still keep an eye out for updates, as they are coming out periodically and delivering requested functions.
Continuous Delivery Follow-up

October 4, 2016: A DB2 12 recent announcement includes their Continuous Delivery message, with new functions via the service stream: IBM DB2 12 for z/OS expands the value offered to your business by IBM’s industry-leading mainframe data server

Mainframe

Our “Mainframe” topic was a discussion on z/OS upgrade timing considerations.

z/OS R13 is now out of service since end of September 2016, five years of regular service support since GA. There are three consecutive releases of coexistence (with releases planned on coming out every two years).

This “discrepancy” between five years of service and and six years (three times two) of coexistence has been quite interesting and deserves some thought. Marna talks about some considerations, and it might be that the “n-2” model should be reconsidered to be a “n-1” model for some customers.

Performance

Our “Performance” topic was an extension of this blog post of Martin’s: Right On Queue.

Martin talks about Coupling Facility list structures, and how they are different from lock and cache structures. He also covers some considerations and causes for how they might get filled up. (Think of the analogy of a pipe getting blocked as one case.)

Sizing is important and he uses SMF 74-4 and RMF Monitor III. A good rule of thumb is that your structure’s maximum size should be in the range of 50% to 100% of the current size. More than double puts you at risk of having a list structure full of control blocks and little data. You also need to monitor how much of the current size is actually in use.

Topics

In our “Topics” section we discussed a travel app called Waze. It’s a crowd-sourcing app you can use to get real-time travel estimates and routes. It also alerts you about such items as accidents, debris, police cars, etc, which other users have reported. This app is particularly useful even if you have to put up with a very small amount of advertising.

Where We’ll Be

Martin is in the *shires (Buckinghamshire, Yorkshire, Wiltshire), as well as a short trip to Amsterdam, during the rest of the year (at the time of going to press). And…also with Marna in:

Guide SHARE Europe UK, November 1-2, 2016. A roving microphone might appear, so please join the conversation if you wish!

Marna is going to:

IBM TechU Comes to You in Austin, TX, November 14-16, 2016

On The Blog

As well as Right On Queue, Martin posted to his blog since our last episode:

Automatic for the Peep-Hole – about experimenting with automation for his Apple watch to dictate and send emails – using both web-based and on-device tools. It was really a test bed for thinking about when on-device automation is best and when web-based automation is better.
Transaction Counts – about counting transactions with RMF.

Contacting Us

You can reach Marna on Twitter as mwalle and by email.

You can reach Martin on Twitter as martinpacker and by email.

Or you can leave a comment below.

Right On Queue

(Originally posted 2016-10-22.)

Seasoned readers will recognise the title of this post as a bad pun, rather than a mis-spelling. [1]

One emergent theme in our code for Parallel Sysplex Performance is treating individual coupling facility structures on their merits. For example, lock structures are different from cache structures.

But there is much commonality in the instrumentation. For example Maximum Size, Size and Minimum Size are common to all.

One type of structure I haven’t paid much detailed attention to is List structures. Two common examples are:

XCF Signalling Structures [2]
CICS Shared Temporary Storage queues[3]

But an incident recently led me to think about List Structure behaviour:

Two test systems with CICS regions on were sharing a Temporary Storage Queue List structure. The structure itself is 20MB in size (with a Maximum Size of 98MB)[4]

The structure itself got to full.

If you approach the structure as some form of queue it helps, because it lets you muse in the following ways:

Maybe the reader stopped reading.
Maybe the writer suddenly splurge wrote.
Maybe the writer outpaced the reader for some other reason.

The truth of it does need sorting out. All of these are feasible explanations in a testing scenario but you wouldn’t want to go into production like this.

In a queuing environment you have to think about how big a queue is required.[5]

In general a large queue (buffer) helps with transient variations in writer and reader speed; It doesn’t help much with persistent outpacing.

But what can put a “bung” in the pipe? Or appear to?

A dead reader can do it – whether (in this case) a CICS region, the DB2 it connects to, the LPAR or the machine. You get the picture, I’m sure: It’s not just the actual reader that matters.
“Market Open” – where a concerted spike in writes can remain unmatched for a while.

So we need to monitor certain list structures. In SMF 74–4 we have, among other things:

Maximum number of elements – R744SMAE
Current number of elements – R744SCUE

Plotting the latter as a % of the former is probably the right thing to do. Obviously an RMF interval of, say, 15 minutes might not catch sudden spikes.

But in the “Market Open” type of scenario it’s worthwhile trying to understand what it does to major queues. And as this post is about list structures those would include XCF signalling structures, CICS Temporary Storage queues and MQ shared message queues.

In the case I mentioned, the structure was resized to 49MB. I didn’t hang around to see what the resolution was, from the CICS point of view.

One final thought: Don’t be tempted to set the Maximum Size of a structure ludicrously big, relative to the Initial Size (or even the expected day-to-day size): I have it on good authority the structure would be full of control blocks, rather than data.

An even worse pun would be “write on queue”, of course. 🙂 ↩
Detectable from SMF 74–2 XCF records’ Path Data Sections. ↩
You can detect the address spaces because their program name is DFHQXMN but not the structures directly from SMF. Generally, however, the list structure name is mnemonic. ↩
I’ve no real idea, by the way, if this is too small. I guess that’s part of the point of this post. ↩
We’ve been here before (some of us) with BatchPipes/MVS “Pipe Depth (BUFNO)”. ↩

Automatic For The Peep-Hole

(Originally posted 2016-10-09.)

I have to admit to being a bit of a wannabe when it comes to automation.

Certainly most of my career has been built on using and building tools – and you’d have to pry them out of my luke-warm retired hands. 🙂 But when it comes to automation in my personal life it’s a bit of a different story:

I haven’t (yet) got into Home Automation. Baby steps still.¹
I don’t use many automation scripts on computers and iThingies.

Now this might surprise some people. But my modus operandi is much closer to “find a real use case” than you might think; I have to find projects that look like they’re close to a pay-off.

Anyhow, I have had a fair amount of practice trying to put workflows together, generally with decent results. Which leads me to slightly abstract musings on the subject of Automation.

In any case, I hope this post is in some small way an eye opener for you as to what you can do with the hardware and software (literally) to hand.²

Having installed Watch OS 3 on my Apple Watch³ I’ve found much to like; The usability, particularly the speed boost and the new dock, has improved to the point I want to play with it much more.

(I also paid a lot of money for a Task Manager that has a very nice Apple Watch interface – OmniFocus – but that’s another story.)

So I’m happy to input text on the Apple Watch – indeed inspired by Omnifocus⁴ – and there are lots of ways to do that. Given that, I thought a nice experiment would be to craft workflows where I can input text on the watch and have that sent as an email to my work email address.

Experiment: Sending An Email To Work

I tackled the exercise of dictating into the Watch and having it email me two different ways:

Workflow – running entirely on the iPhone and the Apple Watch.
DO Button by IFTTT and IFTTT – which mostly uses services on the web, kicked off by the Do Button app on the Watch.

One key difference between these two approaches is that Workflow is entirely device-oriented, whereas IFTTT has a heavy dependence on external services. Of course, both approaches require an external agent to actually send the email.

So let’s examine the two approaches in a little more detail.

With Workflow – Solely On iOS and Watch OS

I can rapidly kick off a workflow from the dock in the Watch. The left side below shows the first screen. You can dictate from there. The result is the screen on the right.

If I tap on “Done” the workflow continues, but there’s a twist:

I deliberately (and gratuitously) inserted a stage that gets the phone’s battery level. Obviously this can’t be run on the watch and, more importantly, can’t be run on the web. It has to be run locally and this is the key point:

Automation on the device can pick up things only the device knows about.

Setting up this workflow was very easy – being entirely on the iPhone. To make it work from the Watch I just had to select that as an option.

I will say the folks that make Workflow are very responsive and are rapidly adding to its capabilities.

You can get workflows others have built from within the app, and browse them on the web.

With IFTTT – External Automation

The IFTTT approach is a bit different. For a start you compose recipes using a Web interface, or use ones already built.

Secondly, the trigger for the recipe is a separate app – Do Button.

Thirdly, the action really takes place on the web.

One consequence of web-orientation is that it is device-neutral with an Android client being available. Or even not using a device at all. A couple of my recipes don’t use a device.

Again the action starts in the dock on the Watch.

The left side below shows the first screen of the recipe. The right side shows the dictation screen.

This time I have no ability to insert the phone’s battery level. But that’s not a real-world requirement for me.

I will say I found the recipe creation process a little more cumbersome, but not really difficult.

Again the developers are adding capabilities all the time.

Conclusion

While there’s quite a lot of automation you can do solely on an iOS device – and Workflow is not (quite) the only game in town – eventually most workflows (automation scripts, if you prefer) will need external services. Sending an email is just one of those cases.

But I would counsel people to do as much automation on the device as possible, for three reasons:

It’s probably easier to develop with e.g. the Workflow editor.
Security is probably better.
Speed will be better.
You can test – and possibly run in “Production” – even when there is no network connectivity. At least up to a point.

But the “on device” and ’fetching out for external services" approaches are not mutually exclusive. For example Workflow has an IFTTT action – where a named recipe can be invoked. It’s just that making good choices as to how to automate pays dividends. And at any given time each mode – on-device and on-web – will have access to different sources of data and actions.

By the way, the screenshots were taken by:

1) Pressing the digital crown and the side button simultaneously.⁵ This stores the screenshot in the Photos app.

2) Using the LongScreen app to stitch the photos together.

Well, I hope I’ve encouraged some of you to play with some nice toys; Despite what I said at the beginning I have a few choice workflows that ease my life.

And I’ll leave it to you to figure out the title. 🙂 It’s a rather contrived pun.

I just got an Amazon Echo as a real first step. ↩
Or indeed on your wrist. ↩
It’s a Series 0, as some people have dubbed it, or the original Apple Watch. I think I’ll skip Series 2 and await Series 3, perhaps next year. ↩
I use dictation to send new tasks to my Inbox for later classification. I’ve been known to pull into a lay-by to do this. 🙂 ↩
That behaviour has to be restored on Watch OS 3 from the Watch app on the iPhone. ↩

Transaction Counts

(Originally posted 2016-10-06.)

I’ve been musing on counting transactions for a customer recently. I’d like to share some of that thinking with you.

This post is about RMF SMF Type 72 data, rather than middleware-specific stuff. That’s because it’s

Generic – applicable to multiple transaction managers.
Much lighter weight – so every customer can collect, retain indefinitely, and process it.

I’m sure this customer is far from alone in being interested in where growth came from. Because they are a CICS / DB2 and DDF customer I’ll concentrate on that, particularly CICS.

I’ve actually had no IMS situations recently. Also TSO transaction rates are rarely significant in the customers I see, so I’ll ignore TSO.

Batch is quite significant in this customer, but it requires a completely different treatment. Perhaps I’ll write about it some other time.

When I say “growth” it is of course a combination of two factors:

Growth in transaction rates.
Changes in CPU time for each transaction.

DDF

I’m going to discuss DDF transactions only briefly; I’ve talked about them a fair amount, not least in More Fun With DDF.[1]

Perhaps more useful is this presentation of mine [2]

But to recap what many people already know: DDF Transaction rate is recorded at the Service Class Period (also Report Class) level – in SMF 72.

This doesn’t really help you when it comes to CPU per transaction. For that – at the DB2 subsystem level – you get DDF transaction rate and Enclave CPU (plus response time). [3]

CICS

CICS is an interesting case, and one I’ll talk about for the rest of this post.

In what follows I’ll refer to the following example, which incorporates a number of typical elements.

If your CICS work is managed to WLM Region goals you don’t get transaction endings.

If transaction Service Classes are used the transaction rate is recorded. [4]

In the example transactions enter through a TOR and progress thence to an AOR. For most topologies the transaction is counted once in SMF 72 even if the transaction spans multiple regions. With SMF 110 CICS Monitor Trace enabled in both the TOR and the pair of AORs, you would see transactions ending in both places. The 110 view of transaction rate would be twice that of the 72 view.

On the subject of growth, for CICS at least difficult to calculate CPU per transaction

Transaction service classes not same as region ones
- CPU recorded in region service classes

Difficult To Relate To Business Transactions

How IT transactions are wired together to form business transactions can be difficult to ascertain. In the example there are two business transactions – one in blue and one in green.

Both pass through some intermediate infrastructure, perhaps a web server. Even how non-z/OS transactions turn into z/OS ones can be difficult to ascertain. In our example:

Business Transaction 1 (in blue) spawns two CICS transactions – which each pass through the TOR to separate AORs the one DB2.
Business Transaction 2 (in green) spawns a single CICS transaction – which passes through the same middleware components. Possibly it uses the same transaction IDs as Business Transaction 1.

It’s worth keeping an eye on how Applications folks wire together transactions as they can be subject to change; While CPU per CICS transaction might not change the number of them that form a business transaction might.

The trend is towards more complex business transactions – which could mean a heady mix of more CICS transactions and heavier ones.

Difficult To Calculate CPU Per Transaction

As I alluded to when discussing DDF, the CPU per CICS transaction can’t be gained from SMF 72 as the region Service Classes have the CPU and the transaction Service Classes the transaction rate.

If, however, you had a transaction Report Class that corresponded to the region Report Class you would be able to use the data from the two to perform the calculation – CPU from the region Report Class and transaction count from the transaction Report Class.

But what do I mean by this?

If the transactions for a region in a specific Report Class had one of a set of Report Classes that were specific to that Report Class the correspondence could be made.

So, for instance, all the regions for the ATM application have Report Class RRCICATM. The second “R” refers to “Region” and “ATM” refers to the fact this is for the ATM application.

All the transactions that run in these regions have Report Classes like RTCICAT1, RTCICAT2, etc.. When these transactions run in different regions[5] their Report Classes have to be different. Here the first “T” says “this is a transaction Report Class”. “AT1”, “AT2” etc are for the ATM application.

Personally, I think this might be a little fiddly to achieve. But I offer it as a suggestion.

Time To Rework CICS Report Classes?

There are lots of reasons for examining your WLM policy periodically. What I’ve discussed in this post is just another reason to.

Some specific things I’d suggest in this area, for Report Classes, are:

Make good use of report classes for transactions.

For example, breaking out Mobile.
Ensure report class transaction rates add up to the corresponding service class’s transaction rate.

Unless you’re using Report Classes to aggregate Service Classes they should provide a useful breakdown of the Service Class transaction rate.
Consider the technique I outlined to relate transactions to regions.

A couple of notes on implementation:

It’s safe to introduce changes to the Report Class setup one step at a time; There is no impact on performance.
If you’re tracking through time (and you should) changes to the Report Class (and Service Class, for that matter) setup are likely to introduce problems when comparing “before” to “after”. [6]

In general, though, I would be trying to calculate transaction rates and CPU per transaction on a daily basis, as well as over the longer term.

“Daily” might surprise you but with SMF 72 it’s lightweight and it just might catch an application change that either introduces more IT transactions or makes them heavier.

This will in turn point you to a veritable thicket of posts about DDF. ↩
I’m about to update this for UK GSE Conference (November 2016). ↩
see DB2 DDF Transaction Rates Without Tears. ↩
Also response time distributions, relative to the goal, as depicted here. ↩
Unlikely in this example. So perhaps a poorly chosen one. ↩
Those are quite bad enough anyway. One problem we encountered was trying to find comparable “Month Ends”. ↩

Mainframe Performance Topics Podcast Episode 7 “We Were On A Break”

(Originally posted 2016-09-10.)

Getting back “in the studio” was really nice. And we never had any doubt we’d keep recording – so the title is very tongue in cheek.

Below are the show notes.

The series is here.

Episode 7 is here.

Episode 7 “We Were On A Break” Show Notes

Here are the show notes for Episode 7 “We Were On a Break”. The show is called “We Were On a Break” because:

It’s been a very long time since we last recorded an episode. You should read nothing into other than our schedules and, in particular, Martin’s long holiday put paid to recording for a while.

But now we’re back…

We had one piece of follow up:

IBM Doc Buddy – available for iOS and Android.

This app has been enhanced with new components (aka libraries) and has received some fixes that users have found. There’s a very reponsive team working on this tool! This app is now better than the old LookAt tool, since reason codes can be searched.

Mainframe

Our “Mainframe” topic was a discussion on Continuous Delivery.

Marna talked about four important references to understand what the z/OS platform is doing for Continuous Delivery. (IBM is embracing Agile development for many new functions, and will be providing those functions to customers in a Continuous Delivery method.)

March 18, 2016 Redpaper on the z/OS base operating system’s usage of Continuous Delivery: IBM z/OS Continuous Delivery
May 24, 2016 General IBM announcement on what an IBM product might do for Continuous Delivery: IBM Software Support Lifecycle Policy is enhanced with a continuous delivery support model for eligible, on-premise IBM software products
April 19, 2016 IBM MQ for z/OS announcement for usage of Continuous Delivery: IBM MQ for z/OS, V9.0 delivers new, more flexible delivery and support options, standards-based messaging support for additional environments, and improved management and administration capabilities. Note “… From this version of IBM MQ onwards, IBM offers a new delivery and support model that separates out the delivery of defect fixes from the delivery of any new function. …”
July 12, 2016 IBM CICS Transaction Server announcement for some functions using Continous Delivery: IBM CICS Transaction Server for z/OS, V5.3 is enhanced with continuous delivery and the IBM CICS Transaction Server for z/OS, V5.4 open beta offering is introduced . Note “….delivered using the standard CICS service channel..”

Takeaway: some products will be putting their new functions in the service stream, while others might be putting them in releases. Read announcements carefully to see which of your products is following which model.

Performance

Our “Performance” topic was an extension of this blog post of Martin’s: Why Do We Keep Building Bigger Machines?

We acknowledge this is quite a high level treatment but it’s a question that we’re sure has been in the back of lots of minds. We’ve ideas to take some of the subtopics and make them topics in their own right.

Topics

In our “Topics” section we discussed what we (especially Martin) are using for creating presentations these days.

Products Martin mentioned were:

These are all available in some form or other for both Mac OS and iOS. And, of course, other tools are available.

Where We’ll Be

Martin is going nowhere fast. 🙂 Seriously, his travel plans are relatively local for the next few weeks.

Marna is going to:

Interesting Customer Requirements

Here’s two customer requirements we’ve taken notice of. Of course, IBM may or may not decide to do them, but they might be interesting if you’d like to vote on them.

“zFS Definitions of Greater Than 4 GB Not Being SMS-Managed Should be Available Under IDCAMS”, ID 92523
“Let IBM Knowledge Center search within a manual and simplify the use”, ID 93288 .

Request for Enhancements (RFEs) can be found here. Most z/OS items are under Brand “Servers and Systems Software”, and Product “z/OS”. Hint: use “I want to specify the brand, product family, and product” when searching.

On The Blog

As well as Why Do We Keep Building Bigger Machines?, Martin posted to his blog since our last episode:

A Record Of Sorts – about using DFSORT SMF 16 with SMF=FULL creatively to track record counts across job steps.
Fearful Symmetry – about why things are out of balance.
Correlation Not Corroboration – about DB2 SMF 101 Accounting Trace Not Accounted For Time.

Contacting Us

You can reach Marna on Twitter as mwalle and by email.

You can reach Martin on Twitter as martinpacker and by email.

Or you can leave a comment below.

Why Do We Keep Building Bigger Machines?

(Originally posted 2016-09-03.)

I know of no customer who uses the full capacity of a zEC12, let alone a z13? ¹ So why do we make them bigger each time?

I should state this post is not in support of any product announcement; It’s just scratching an itch of mine.

I think it’s an interesting topic; I hope you agree.

What Is Bigger?

While this post isn’t exhaustive I think the main aspects are:

Processor Capacity
Memory
I/O Capability
Number Of LPARs

While I’ll touch on these, as examples, I won’t talk much about engine speed; That’d be a whole other post – if I were to write about it.

Where Are Most Customers?

This is just from my personal customer set, but most of my customers are in the range of 10 – 20 purchased processors per machine. Quite a few have sub-capacity processors.

Generally they have two or three drawers (on z13) or a similar number of books (z196 and zEC12). And most of my customers’ machines are either zEC12 or z13, with a few z196 footprints remaining.

Memory-wise, I’m seeing sub-terabyte to several-terabyte configurations, depending mainly on generations.

Customers I work with tend to have two or more machines.

Typically, customers have more than 10 LPARs on a footprint.

I don’t think any of the above is giving away any secrets. And not all customers are like this.

So Why Build Bigger Machines?

There are a number of reasons, which benefit a wide range of customers. Here are some that come to mind.

Scalability

To meaningfully achieve 141 processors (or 10TB of memory) on a single footprint requires good scalability.

I remember, just after the dawn of multiprocessor mainframes, how awful the multiprocessor ratios were. To achieve even modest levels of multiprocessing a lot had to change. And indeed it has, both in software and hardware.

To be able to scale to 141 processors successfully means good multiprocessor ratios are essential. For your 15-way to be feasible, scalability has to be good across the board, all the way up to 141.

The analogy of “the Moon Shot led to non-stick frying pans” is perhaps inappropriate, but the idea that engineering needed for top end machines yields results for smaller machines is sound.

Running Everything On One Surviving Footprint

Bad stuff happens thankfully rarely to mainframe footprints, but when it does customers need to run their high-importance workloads somewhere.

One of the scenarios wise customers plan for is running (the bulk of) two machines’ worth of work on one. Under those circumstances a normally, for example, 20-way might need to become a 35-way. And be effective at it.

So your operating range might need, in an emergency, to be much higher up the scale.

But it’s not just the “machine gone” scenario that has to be catered for. Indeed a subset of the drawers ² in a machine might need to be taken out of service. Then you’d still want to run on the surviving drawers. So, a more powerful physical machine is a good thing, under those circumstances. ³

Unexpected Demand

While the economics of unexpected demand might not be nice, the inability to support a sudden massive increase in workload is even worse.

Most customers I know could grow their workload several times over and still be contained within the same number of footprints.

The trick is to avoid derailment factors. Perhaps “wargaming” massive growth scenarios should be seen in the same light as Disaster Recovery tests.

Two examples:

The use of the various capacity-on-demand capabilities.
Middleware scalability e.g. CICS QR TCB.

LPAR Limits

I know customers for whom the (pre-z13) limit of 60 LPARs on a footprint was a real limitation. These are mostly outsourcers.

Several use zVM but it would be nice not to have to ⁴.

I would say a prerequisite to raising the limit to 85 (on z13) was raising the limit on the number of configurable processors way past that. In the distant past I was involved in a Critsit with very large numbers of z/OS images on a footprint.

LPAR design is, of course, critical in this. And Hiperdispatch helps.

Memory

Physically installing memory is one thing; Making it perform is quite another.

For example, we’ve several times changed the fundamentals of memory management in z/OS over the years. ⁵

But note the continuing evolution of the way middleware uses memory.

Also note the way memory pricing has substantially improved over the years.

Closing Thoughts

Workloads are generally growing quite rapidly, mainly through two factors:

Increasing business volumes
More being done with each datdatum

So what might today seem very large might seem much more modest going forward.

I’ve touched on more than just CPU because configuring systems in a balanced way is important. And you can see we pay attention to that in the following graphic.

This polar chart is for z13 and it shows how over the generations growth has been across all aspects.

To be specific about CPU, the following chart shows steady growth.

(By the way these two charts were sourced from the most excellent TLLB (Technical Leadership Library).)

We’ve come a long way!

I’m sure there are some fully-configured machines in the world, but I’ve yet to encounter them personally. ↩
Or books if you are on a machine prior to z13. ↩
As an aside, the first physically-partitionable machine I remember was the 3084-QX; It could be split into two independent 2-ways. I’m not sure if this ever had to be done to rescue one half. ↩
This is not an anti-zVM statement, of course. ↩
Are you still using UIC for much? If so please stop. ↩

A Record Of Sorts

(Originally posted 2016-08-27.)

When looking at a batch job¹ I like to see how the data flows through the various steps.

The first step – some 23 years ago 🙂 – was to look at the Life Of A Data Set (“LOADS” for short).²

With LOADS – for VSAM and non-VSAM data sets – you can see who reads and writes the data set. You can also see the EXCP count. More on that in a bit but suffice it to say EXCP count might be enough to tell you if the data set was written or read in its entirety.

Why Record Counts Matter

Probably just out of curiosity. 🙂

Actually, really not…

I just said I can detect readers and writers and I used the words “in its entirety”. But I think it useful to go deeper. Here are two – off the top of my head – reasons to want record counts:

Because business volumes can show up in record counts. For example, a transaction file’s record count is the number of transactions in the life of this version of the data set.
Because it might explain some other count. More on this one in a minute.

Estimating Record Counts

I just used the word “estimating”. Under some circumstances we can do better than estimating, as we’ll see.

One of the reports our “Job Dossier” code produces is called “Job Data Set”. Basically a list of steps and the data sets each step accesses.³

For data sets accessed by QSAM we can estimate the number of records in the data set by examining the LRECL, the Block size and the EXCP count. But there are lots of problems with this:

This is only going to work for Fixed-Blocked (FB) data sets.
Compression complicates things. We need to fix our code to handle this – though today we print the compression ratio.
The assumption is the processing is sequentially start-to-finish.
You might do a small number of EXCPs not related to actual data transfer.
It’s likely the step will read or write partially-filled blocks.

Still, where applicable it’s a good start.

But we can do better:

DFSORT’s SMF 16 tells you the overall counts of input records and output records⁴ whether SMF=FULL⁵ or not.

So in a very simple case – a single sort invocation in a step – we can use these record counts to estimate the number of records in the SORTIN and SORTOUT data sets. And we can find the SORTIN data set represented by an SMF 14 record and the SORTOUT data set by an SMF 15 record.

Record Counts And SQL Statements

Several times in a recent batch study the SMF 101 SQL counts have borne some relation to record counts. Consider the following (very realistic) scenario:

The sort step reads a data set (SORTIN DD) and writes one (SORTOUT DD). The DB2 step reads the same data set and does something with DB2 data based on the records read.

For example, in one job step the Singleton Select count matches the input record count.

So we can glean that the selects are record-driven – just with SMF.

By the way, we match SMF 101 records with SMF 30–4 Step End records by Timestamp comparison and Correlation ID matching, which I describe in gory detail in Finding The DB2 Accounting Trace Records For an IMS Batch Job Step. Ignore the “IMS” bit if you like; The preamble is the more general bit.

What My Code Does Today

So, the essential thing is that DFSORT keeps good account of the records written – overall. For output data sets it keeps good counts at the individual data set level (with SMF=FULL).

We map all this, of course.

My first toe in the water is very limited:

For the “single sort in a step with one input data set and one output data set” case I use the SMF 16 record counts as the data set sizes. These overwrite any EXCP / block size / LRECL estimate for FB data sets – as it’s more accurate.

The really nice thing is it gives me an accurate estimate for VB data sets, which I didn’t have before.

Possible Extensions

A number of quite feasible extensions are:

I could keep the output data set’s record count once I’ve got it and use it in downstream steps. If it gets rewritten then the previous estimate could be invalidated, so that’s safe.
It would be tricky but I could propagate backwards the input data set’s record count to previous steps that read or wrote the data set.
I could use the OUTFIL and Output File sections in the SMF 16 record (as we query them) to handle the “multiple output data set” case.
With multiple input data sets I could pro-rate the input record count across them using the Access Method calls count in SMF 16 Input FIle sections of the SMF 16 record. (This one is dodgy but better than “I’ve no idea”.)
I said “single sort in a step” but there is enough timestamp instrumentation to do better than that. But where do multiple sorts in a step come from? Here are some examples:
- DB2 Utilities – where record counts would be especially useful
- ICETOOL
- DFSORT JOINKEYS
- Programs that happen to invoke DFSORT multiple times
I don’t flag whether a record count is exact – from DFSORT – or estimated. The latter could be printed in italics.

This is quite a long list of potential extensions – but each one is fiddle. Some will get done; Some possibly won’t.

All I know is our code’s ability to estimate record counts took a leap forward, and that is proving useful straightaway. And writing this has helped me sort my thoughts out, as has explaining it to a couple of friends (with a stake in this). And I haven’t even begun to talk about VSAM yet… 🙂

Or indeed a whole suite of jobs. ↩
Last mentioned in DFSORT JOINKEYS Instrumentation – A Practical Example, a post I need to write a follow on to. There is good news to share. ↩
There’s much more in it but this will do for now. ↩
As well as Inserts and Deletes. ↩
I much prefer SMF=FULL as it gives you really nice stuff like individual input and output data set information. ↩