CF LPARs In The GCP Pool?

This post is about whether to run coupling facility (CF) LPARs using general purpose processors (GCPs).

It might be in the category of “you can but do you think you should?”

I’d like to tackle this from three angles:

  • Performance
  • Resilience
  • Economics

And the reason I’m writing about this now is the usual one: Recent customer experiences and questions.

Performance

What kind of performance do you need? This is a serious question; “Quam celerrime” is not necessarily the answer. 🙂

A recent customer engagement saw CF structure service times of around 1500μs. This is alarming because I’m used to seeing them in the range 3-100μs. This does, of course, depend on a number of factors. And this was observed from SMF 74 subtype 4 data.

You might be surprised to know I thought this was OK. The reason is that the request rate is extraordinarily low.

So what’s wrong with the inherent performance of CF LPARs in the GCP Pool? Nothing – so long as they are using dedicated engines, or at any rate not getting delayed while other LPARs are using the GCP engines.

But see Economics below.

So just like running on shared CF engines, really.

So the customer with the 1500μs service time was using a CF LPAR in the GCP pool capped with a very low weight. So, basically starved of CPU.

It turns out that the request rate being very low means the applications aren’t seeing delays and there is no discernible effect on the coupled z/OS systems. And that’s what really counts.

Resilience

By the way, the word is “resilience”. (Even the phone I’m writing this on says so.) I’m having to get used to people saying “resiliency” (which my phone is accepting of but not volunteering).

Having the CF LPARs in the same machine as the coupled z/OS LPAR has resilience considerations. Lose the machine and you have recovery issues.

Note the previous paragraph doesn’t mention processor pools. That was deliberate; it doesn’t matter which pool the CF LPAR processors are in, from the resilience point of view.

Economics

The significant question is one of economics: If you’re running on GCP pool engines these, of course, cost more than ICFs.

For modern pricing schemes I don’t think CF LPARs in the GCP pool cost anything in terms of IBM software. But there might well be software pricing schemes where they do.

And then there’s the question of maintenance.

All in all, the economics of placing a CF LPAR in the GCP pool will depend on spare capacity.

Conclusion

Yes you can, and just maybe you should. But only if the performance and economic characteristics are good enough. They might well be.

And you’ll see I’ve deliberately couched this in terms of “this is little different from any Shared CF engine situation”. The main difference being the economics.

One final point: if you need to have a CF LPAR and you’re on sufficiently old hardware you might have little choice but to squeeze it into a GCP pool. But do it with your eyes open. Unless you’d like to consider the benefits of a z16 A02 or z15 T02 – eg for production Sysplex designs .

Like You Know?

Three recent events led to this blog post:

  • I was driving a while back and listening to podcasts. Two in particular by highly experienced podcasters who are very articulate.

  • Meanwhile I’d just completed editing a podcast episode of my own. (This was Episode 32, which you might have listened to already.)

  • And a few weeks before that I was involved in a debate on an online forum about editing podcasts.

The common thread is humanity in the finished product. Versus professionalism, I suppose.

The online debate saw me advocating leaving some “humanity” in recordings, whereas others wanted “clean” recordings.

The podcasts I listened to today had “um”, “er”, “you know”, “like” aplenty. Because of editing and the debate online I was listening out for this. These “verbal tics” did not detract at all. Indeed the speakers sounded informal and human.

Now, it has to be said I’ve met all these podcasters – and would expect to be able to have good, friendly conversations when we meet again.

To bring this to Marna’s and my podcast, we are striking a particular pose. A genuine one but a conscious one: While we both work for IBM neither of us is making formal statements on IBM’s behalf. And, while we might have the tacit encouragement of our management, we’re not directed by them. In short, we don’t consider our podcast a formal production but just two friends having fun making a contribution.

If we were scripted and professionally produced we’d sound a lot different. And I think our episode structure would be different. And so would the content.

Having said that, I have two principal aims when editing:

  • Reduce the incidence of “um” etc to a listenable level.
  • Faithfully reproduce the conversation.

To that end, I generally don’t move stuff around or edit for content. I also try to do the easy edits and, beyond that, leave a few verbal tics in.

So you won’t get clean recordings from us. But you’ll get what we’re thinking, with some humanity left in.

And I’d say this song wouldn’t work without “you know”. 🙂

Mainframe Performance Topics Podcast Episode 32 “Scott Free”

Episode 32 was, as always a slow train coming. I think it’s a fun one – as well as being informative.

It was really good to have Scott back, and we recorded in the Poughkeepsie studio, just after SHARE Atlanta, March 2023.

Talking of which, the aftershow relates to SHARE. It’s a classic example of “today’s crisis is tomorrow’s war story”.

Anyhow, we hope you enjoy the show. We enjoyed making it.

And you can get it from here.

Episode 32 “Scott Free” Long Show Notes

Our guest for two topics was Scott Ballentine of z/OS Development, a veritable “repeat offender”. Hence the “Scott Free” title.

What’s New

Preview of z/OS 3.1

Mainframe – SMFLIMxx Parmlib Member

We discussed SMFLIMxx with Scott.

SMFLIMxx is a parmlib member, as a IEFUSI exit replacement. Its function is related to storage and specifying limits. Functions are delivered as continuous delivery. Two examples are

  • The SAF check
  • Number of shared pages used

In z/OS 3.1 customers will be able to specify Dedicated Real Memory Pools to assign memory to a specific application, like zCX. You will be able to use all frame sizes – 4KB, 1MB, and 2GB.

Performance – Open Data Sets, Part 1

This, the first of two topics on open data sets, was also with Scott. He’s very much the Development expert on this.

The main use for having very many open data sets (think “100s of thousands”“) is for middleware, most notably Db2.

Most of the constraint relief in this area is moving control blocks above the 2GB bar. In ALLOCxx you have to code SWBSTORAGE with a value of ATB to put SWB (Scheduler Work Block) control blocks above the bar. Applications need to exploit the service – or it has no effect.

Monitoring virtual storage is key here, and remains important: Factors such as the number of volumes for a data set affects how much memory is needed, so virtual storage estimation is difficult to do.

You can probably guess what Part 2 will be about.

Topics – Evolution of a Graph

Martin explored the large improvements he’s made with his custom graphing programs. (He already posted about what he sees with one of them here.) But this topic wasn’t about the technical subject of the graph, more the evolution from something “meh” to something much better. The evolution process was:

  • Start with a query that naively graphs database table rows and columns, with labels generated by the database manager and fixed graph titles. Not very nice, not succinct, not very informative.
  • Generating the titles with REXX, giving them more flexibility and allowing additional information to be injected.
  • Using REXX to drive GDDM directly – which enabled a lot of things:
  • REXX was able to generate many more data points and to plot them directly. (In particular the code is able to show what happens at very low traffic rates whereas previously it had had to be restricted to the higher traffic rates.)
  • REXX could generate the series names, making them friendlier and more informative.

The purpose of including this item, apart from it being a fun one, is Martin encourages everybody to evolve their graphs, to tell the story better, to run more efficiently, and to deal with underlying technological change. Don’t put up with the graphing you’ve always had!

Customer Requirement

ZOSI-2195 “Extended GDG causes space issues”, has been satisfied: IGGCATxx’s GDGLIMITMAX with OA62222 on V2.3.

On The Blog

Marna’s NEW blog location is here. There are three new posts:

Martin has quite a few new blog posts here:

So It Goes.

A Very Interesting Graph

They say beauty is in the eye of the beholder. But I hope you’ll agree this is a pretty interesting graph.

It is, in fact, highly evolved – but that evolution is a story for another time and place. I want to talk about what it’s showing me – in the hope your performance kitbag could find room for it. And I don’t want to show you the starting point which so underwhelmed me. 😀

I’m forever searching for better ways to tell the story to the customer – which is why I evolve my reporting. This one is quite succinct. It neatly combines a few things:

  • The effect of load story.
  • The distance story.
  • A little bit of the LPAR design story.
  • The how different coupling facility structure types behave story.

Oh, I didn’t say it was about coupling facility, did I?

I suppose I’d better show you the graph. So here it is:

You can complain about the aesthetics of my graphs. But this is unashamedly REXX driving GDDM Presentation Graphics Facility (PGF). I’m more interested in automatically getting from (SMF) data to pictures that tell the story. (And I emphasised “automatically” because I try to minimise manual picture creation fiddliness. “Picture” because it could be a diagram as much as a graph.)

So let’s move on to what the graph is illustrating.

This is for a(n) XCF (list) structure – where the requests are issued Async and so must stay Async.

Graphing Notes:

  1. Each data series is from a different system / LPAR in the Sysplex.
  2. This is the behaviour across a number of days for these systems making requests to a single coupling facility structure.
  3. Each data point is an RMF interval.

Service Times Might Vary By Load

By “load” I mean “request rate”.

I would be worried if service times increased with request rate. That would indicate a scalability problem. While I can’t predict what would happen if the request rate from a system greatly exceeded the maximum here (about 30,000 a second for PRDA) I am relieved that the service time stays at about 20 microseconds.

Scalability problems could be resolved by, for example, dealing with a path or link issue, or additional coupling facility capacity. Both of these example problem types are diagnosable from RMF SMF 74-4 (which is what this graph is built from).

Distance Matters

You’ll notice the service times split into two main groups:

  • At around 20μs
  • At around 50μs

The former is for systems connected to the coupling facility with 150m links. The latter is for connections of about 1.4km (just under a mile). The difference in signalling latency is about (1.4 – 0.15) * 10 = 12.5μs. (While I might calculate that the difference is service time is around 2.5 round trips I wouldn’t hang anything on that. Interesting, though.)

It should be noted, and I think I’ve said this many times, you get signalling latency for each physical link. A diversity in latencies across the links between an LPAR / machine and a coupling facility tends to suggest multiple routes between the two. That would be a good thing from a residence point of view. I should also note that this is as the infinibird 😀 flies, and not as the crow does. So cables aren’t straight and such measurements represent a (quite coarse) upper bound on the physical distance.

Coupling Technology Matters

(Necessitated by the distance, the technology between the 150m and 1.4km cases is different.)

I’ve taught the code to embed the link technology in the legend entries for each system / series.

You wouldn’t expect CE-LR to perform as well as ICA-SR; Well-chosen, they are for different distances. Similarly, ICA-SR links are very good but aren’t the same as IC links.

LPAR Design Matters

LPAR design might be “just the way it is” but it certainly has an impact on service times.

Consider the two systems I’ve renamed to TSTA and TSTB. They show fairly low request rates and, I’d argue, more erratic service times.

The cliché has it that “the clue is in the name”. I’ve not falsified things by anonymising the names; They really are test systems. What they’re doing in the same sysplex as Production I don’t know – but intend to ask some day.

The point, though, is that they have considerably lower weights and less access to CPU.

Let me explain:

When a request completes the completion needs to be signalled to the requesting z/OS LPAR. This requires a logical processor to be dispatched on a physical. This might not be timely under certain circumstances. Most particularly if it takes a while for the logical processor to be dispatched on a physical.

What’s good, though, is that the PRD∗ LPARs don’t exhibit the same behaviour; Their latency in being dispatched and being notified the request has completed is good.

Different Structures Perform Differently

I’ve seen many installations in my time. So I know enough to say that, for example, a lock structure didn’t ought to behave like the one in the graph. Lock structure requests tend to be much shorter than cache or list or serialised list structures.

What I’m gradually learning is that how structures are used matters. You wouldn’t expect, for instance, a VSAM LSR cache structure to behave and perform the same as a Db2 group buffer pool (GBP) cache structure.

I say “gradually learning” which, no doubt, means I’ll have more to say on this later. Still, the “how they’re used” point is a good one to make.

Another point in this category is that not all requests are the same, even to the same structure. For example, I wouldn’t expect a GBP castout request to have the same service time as a GBP page retrieval. While we might see some information (whether from RMF 74-4 or Db2 Statistics Trace) about this I don’t think the whole story can be told.

Conclusion

This example doesn’t show Internal Coupling (IC) links. It also doesn’t show different coupling facility engine speeds. So it’s not the most general story.

  • The former (IC links) does show up In other sets of data I have. For example a LOCK1 structure at about 4 μs for IC links and about 5 for ICA-SR links.
  • To show different coupling facilities for the same structure name sort of makes sense – but not much for this graph. (That would be the duplexing case, of course.)

Let me return to the “how a structure of a given type is used affects its performance” point. I think there’s mileage in this, as well as the other things I’ve shown you in this post. That says to me a brand new Parallel Sysplex Performance Topics presentation is worth writing.

But, I hope you’ll agree, the graph I’ve shown you is a microcosm of how to think about coupling facility structure performance. So I hope you like it and consider how to recreate it for your own installation. (IBMers can “stop me and buy one”.) 😀

By the way, I wrote this post on a plane on my way to SHARE in Atlanta, March 4, 2023. So you could say it was in honour of SHARE. At least a 9.5 hour plane ride gave me the time to think about it enough to write the post. Such time is precious.

When Good Service Definitions Turn Bad

I was about to comment it’s been a while since I wrote about WLM but, in researching for this post, I discover it isn’t. The last post was WLM-Managed Initiators And zIIP.

I seem to be telling a lot of customers their WLM service definition really needs some maintenance. In fact it’s every customer I’ve worked with over the past few years. You might say “well, it’s your job to analyse WLM for customers”. To some extent that’s true. However, my motivation is customer technical health rather than meeting some Services quota. (I don’t have one, thankfully.) So, if I say it I mean it.

I thought I’d explore why WLM service definition maintenance is important. Also when to do it.

Why Review Your Service Definition?

Service definitions have two main components:

  • Classification rules
  • Goals

Both yield reasons for review.

Classification Rules

I often see work misclassified. Examples include

  • Work in SYSSTC that shouldn’t be – such as Db2 Engine address spaces.
  • Work that should be Importance 1 but isn’t – again Db2 Engine but also MQ Queue Managers.
  • Work that’s new and not adequately classified. Recall, as an example, if you don’t set a default for started tasks the default is to classify them to SYSSTC. For batch that would be SYSOTHER.

So, classification rules are worth examining occasionally.

Goals

Goal values can become out of date for a number of reasons. Common ones are:

  • Transactions have become faster
    • Due to tuning
    • Due to technology improvements
  • Business requirements have changed. For example, orchestrating simple transactions into ever more complex flows.
  • Talking of business requirements, it might become possible to do better than your Service Level Agreement says (or Service Level Expectation, if that’s what you really have.
  • A change from using I/O Priority Management to not.

Goal types can also become inappropriate. A good example of this is recent changes to how Db2 DDF High Performance DBATs are reported to WLM, as I articulate in my 2022 “Db2 and WLM” presentation.

But why would goal values matter? Take the case where you do far better than your goal. I often see this and I explain that if performance deteriorated to the point where the work just met goal people would probably complain. They get used to what you deliver, even if it’s far better than the SLA. And with a goal that is too lax this is almost bound to happen some time.

Conversely, an unrealistically tight goal isn’t helpful; WLM will give up (at least temporarily) on a thoroughly unachievable goal. Again misery.

So, goal values and even types are worth examining occasionally – to make sure they’re appropriate and achievable but not too easy.

Period Durations

When I examine multi-period work (typically DDF or Batch) I often see almost all work ending in Period 1. I would hope to see something more than 90% ending in Period 1, but often it’s well above 99%. This implies, rhetorically, there is no heavy stuff. But I think I would want to protect against heavy stuff – so a higher-than-99%-in-Period-1 scenario is not ideal.

So, occasionally check on period durations for multi-period work. Similarly, think about whether the service class should be multi-period. (CICS transaction service classes, by the way, can’t be multi-period.)

When Should You Review Your Service Definition?

It’s worth some kind of review every year or so, performance data in hand. It’s also worth it whenever a significant technology change happens. It might be when you tune Db2’s buffer pools, or maybe when you get faster disks or upgrade your processor. All of these can change what’s achievable.

In the aftermath of a crisis is another good time. If you establish your service definition didn’t adequately protect what it should have then fixing that could well prevent a future crisis. Or at least ameliorate it. (I’m biased towards crises, by the way, as that’s what often gets me involved – usually in the aftermath and only occasionally while it’s happening.)

And Finally

Since I started writing this post I’ve “desk checked” a customer’s WLM Service Definition. (I’ve used my Open Source sd2html code to examine the XML unloaded from the WLM ISPF Application.)

I didn’t expect to – and usually I’d also have RMF SMF.

I won’t tell you what I told the customer but I will say there were quite a few things I could share (and that I had to write new function in sd2html to do so).

One day I will get the SMF and I’ll be able to do things like check goal appropriateness and period durations.

But, to repeat what I started with, every WLM Service Definition needs periodic maintenance. Yours probably does right now.

And, as a parting shot here’s a graph I generated from a table sd2html produces:

It shows year-level statistics for modifications to a different customer’s WLM Service Definition. As you can see, activity comes in waves. Practically, that’s probably true for most customers. So when’s the next wave due?

Heading Back Into Db2- Architecture Part 1

I loftily talk about “architecture” a lot. What I’m really getting at is gleaning an understanding of an installation’s components – hardware and software – and some appreciation of what they’re for, as well as how they behave.

When I started doing Performance and Capacity – many years ago – I was less sensitive to the uses to which the machines were put. In fact, I’d argue “mainstream” Performance and Capacity doesn’t really encourage much understanding of what I call architecture.

To be fair, the techniques for gleaning architectural insight haven’t been well developed. Much more has been written and spoken about how to tune things.

Don’t get me wrong, I love tuning things. But my origin story is about something else: Perhaps programming, certainly tinkering. Doing stuff with SMF satisfies the former (though I have other projects to scratch that itch). Tinkering, though, takes me closer to use cases.

Why Db2?

What’s this got to do with Db2?

First, I should say I’ve been pretending to know Db2 for almost 30 years. 🙂 I used to tune Db2 – but then we got team mates who actually did tune Db2. And I never lost my affinity for Db2, but I got out of practice. And the tools I was using got rusty, some of them not working at all now.

I’m heading back into Db2 because I know there is an interesting story to tell from an architectural point of view. Essentially, one could morph tuning into asking a simple question: “What is this Db2 infrastructure for and how well suited is the configuration to that purpose?” That question allows us to see the components, their interrelationships, their performance characteristics, and aspects of resilience.

So let me give you two chunks of thinking, and I’ll try to give you a little motivation for each:

  • Buffer pools
  • IDAA

I am, of course, talking mainly about Db2 SMF. I have in the past also taken DSNZPARM and Db2 Catalog from customers. I expect to do so again. (On the DSNZPARM question, Db2 Statistics Trace actually is a better counterpart – so that’s one I probably won’t bother asking customers to send.)

I’m experimenting with the structure in my two examples. For each I think two subsections are helpful:

  • Motivation
  • Technique Outline

If this structure is useful future posts might retain it.

Buffer Pools

He’re we’re talking about both local pools (specific to each Db2 subsystem) and group (shared by the whole datasharing group, but maybe differentially accessed).

Motivation

Some Datasharing environments comprise Db2 subsystems that look identical. If you see one of these you hope the work processed by each Db2 member (subsystem) in the group is meant to be the same. The idea here is that the subsystems together provide resilience for the workload. If the Db2 subsystems don’t look identical you hope it’s because they’re processing different kinds of work (despite sharing the data).

I think that distinction is useful for architectural discussions.

More rarely, whole Datasharing groups might be expected to resemble each other. For example, if a parallel sysplex is a backup for another (or else shares a partitioned portion of the workload). Again, a useful architectural fact to find (or not find).

Technique Outline

Db2 Statistics Trace IFCID 202 data gives a lot of useful information about individual buffer pools – at the subsystem level. In particular QBST section gives:

  • Buffer pool sizes
  • Buffer pool thresholds – whether read or write or for parallelism
  • Page frame sizes

At the moment I’m creating CSV files for each of these. And trialing it with each customer I work with. I’m finding cases where different members are set up differently – often radically. And also some where cloning is evident. From SMF I don’t think I’m going to see what the partitioning scheme is across clones – though some skew in terms of traffic might help tell the story.

Let me give one very recent example, which the customer might recognise but doesn’t expose them: They have two machines and each application group has a pair of LPARs, one on each machine. On each of these LPARs there is a Db2 subsystem. Each LPAR’s Db2 subsystem has identical buffer pool setups – which are different from other applications’ Db2’s.

Db2 Statistics Trace IFCID 230 gives a similar view for whole Datasharing groups. Here, of course, the distinction is between groups, rather than within a group.

IDAA

IDAA is IBM’s hardware accelerator for queries, coming in two flavours:

  • Stand-alone, based on System P.
  • Using System Z IFLs.

Motivation

The purpose of IDAA servers is to speed up SQL queries (and, I suppose, to offload some CPU). Therefore I would like to know if a Db2 subsystem uses an IDAA server. Also whether Db2 subsystems share one.

IDAA is becoming increasingly common so sensitivity to the theme is topical.

Technique Outline

Db2 Statistics Trace IFCID 2 has a section Q8ST which describes the IDAA servers a Db2 subsystem is connected to. (These are variable length sections so, perhaps unhelpfully the SMF triplet that describes them has 0 for length – but there is a technique for navigating them.)

A few notes:

  • The field Q8STTATE describes whether the IDAA server is online to the Db2 subsystem.
  • The field Q8STCORS is said to be core count but really you have to divide by 4 (the SMT threads per core) to get a credible core count – and hence model.
  • There can be multiple servers per physical machine. But we don’t have a machine serial number in Statistics Trace to tie the servers on the same machine together. But some fields behave as if they are one per machine, rather than one per server. So we might be able to deduce which servers are on which machine. For example Q8STDSKA – which also helps distinguish between generations (eg 48TB vs 81TB).

Wrap Up

I’m sure there’s much more I can do with SMF, from a Db2 architecture point of view. So expect more posts eventually. Hence the “Part 1” in the title. And, I think it’s going to be a lot of fun continuing to explore Db2 SMF in this way.

And, of course, I’m going to keep doing the same thing for non-Db2 infrastructure.

One other note: I seem to be biased towards “names in frames” rather than traffic at the moment. The sources of data do indeed allow analysis of eg “major user” rather than “minor user”. This is particularly relevant here in the case of IDAA. One should be conscious of “uses heavily” versus “is connected to but hardly uses at all”. That story can be told from the data.

Making Of

I’m continuing with the idea that the “making of” might be interesting (as I said in Coupling Facility Structure Versions). It occurs to me it might show people that you don’t have to have a perfect period of time and place to write. That might be encouraging for budding writers. But it might be stating “the bleedin’ obvious”. 🙂

This one was largely written on a flight to Toronto, for a customer workshop. Again, it poured out of my head and the structure naturally emerged. There might be a pattern here. 🙂

As an aside, I’m writing this on a 2021 12.9” iPad Pro – using Drafts. I’m in Economy – as always. I’m not finding it difficult to wield the iPad, complete with Magic Keyboard, in Economy Class. I’m certain I would find my 16” MacBook Pro cumbersome in the extreme.

And, of course, there was tinkering after I got home, just before publishing (but after a period of reflection).

Coupling Facility Structure Versions

When I see an 8-byte field in a record I think of three possibilities, but I’m prepared to discover the field in question is none of them. The three prime possibilities are:

  1. A character field
  2. A 64-bit counter
  3. A STCK value

An interesting case occurs in SMF 74 Subtype 4: Two similar fields – R744SVER and R744QVER – are described as structure versions.

Their values are structure-specific. Their description is terse (as is often the case). By the way that’s not much of a criticism; One would need to write War And Peace to properly describe a record. I guess I’m doing that, one blog post at a time. 🙂

Some Detective Work

With such a field the first thing you do is get the hex(adecimal) representations of some sample contents. In my case using REXX’s c2x function. Here’s an example of R744SVER: D95FCC96 70EEB410.

A Character Field?

While not foolproof, it would be hard to mistake an EBCDIC string’s hex values for anything else. And vice versa. (Likewise ASCII, as it happens.) I think you’ll agree very few of the bytes in the above example look like printable EBCDIC characters.

These fields look nothing like EBCDIC.

A Counter?

I would expect most counters to not be close to exhausting the field’s range. So I would expect the top bits to not be set. Our above example is close to wrapping.

While these values tend to have something like ‘2x’ for the top byte they don’t look like “unsaturated” counters.

So they’re not likely to be counters.

A STCK Value?

I put some sample values into a STCK formatter on the web. I got credible values – dates in 2020, 2021, and 2022.

For the example above I get “07-Mar-2021 06:34:00 ” – which is a very believable date.

So this seems like the best guess by far.

How Do We Interpret This Timestamp?

If we accept these fields are timestamps how do we interpret them?

My view is that this timestamp represents when the structure was allocated, possibly for the first time but more likely a reallocation. (And I can’t see which of these it is.)

Why might this happen?

I can think of a few reasons:

  • To move the structure to a different coupling facility. This might be a recovery action.
  • To restart the coupling facility. This might be to upgrade to a later CFLEVEL. Or indeed a new machine generation.
  • To resize the structure. This is a little subtle: I wouldn’t think, in general, you would reallocate to resize unless you were having to raise the structure’s maximum size.

One thing I’m not sure about is whether there is a time zone offset from GMT. I guess we’ll see what appears credible. I will say that hours and minutes are slightly less important in this than dates. I’m definitely seeing what looks like application-oriented changes such as MQ shared message queue structures appearing to pop into existence.

Conclusion

Guessing field formats is fun, though it is far from foolproof.

I’m a little tentative about this. As with many such things I want to see how customers react to me presenting these dates and times. Call it “gaining experience”.

But I do think this is going to be a useful technique – so I’ve built it into my tabular reporting that lists structures.

As always, more on this when I have something to share.

Making Of

I’m experimenting with the idea that somebody might be interested in how this blog post was made.

The original idea came from a perusal of the SMF 74-4 manual section. It was written in chunks, largely on one day. Two short train journeys, two short tube journeys and a theatre interval yielded the material. It seemed to pour out of my head, and the structure very naturally emerged. Then a little bit of finishing was required – including researching links – a couple of weeks later.

Mainframe Performance Topics Podcast Episode 31 “Take It To The Macs”

This is the first blog post I’ve written on my new work MacBook Pro. While it’s been a lot of work moving over it’s a better place as it’s an Apple Silicon M1 Max machine with lots of memory and disk space.

That’s nice, but what’s the relevance to podcasting?

Well, it’s very warm here in the UK right now and I’ve been on video calls for hours on end. Yes, the machine gets warm – but possibly not from its load. But, importantly, there has been zero fan noise.

Fan noise has been the bedevilment of recording audio. Hopefully that era is now over – and just maybe the era of better sound quality in my recordings is upon us. (See also the not-so-secret Aftershow for this episode.)

As usual, Episode 31 was a lot of fun to make. I hope you enjoy it!

Episode 31 “Take it to the Macs” long show notes.

This episode is about our After Show. (What is that?)

Since our last episode, we were both in person at SHARE in Dallas, TX.

What’s New

  • More new news for CustomPac ServerPac removal date, which has been extended past January 2022. The CustomPac (ISPF) ServerPac removal date
    from Shopz for all ServerPacs will be July 10, 2022. Make sure you order before that date if you want a non-z/OSMF ServerPac. CBPDO is still available and unaffected

  • Data Set File System released OA62150 closed April 28th, 2022 only on z/OS V2.5,
    which we talked about in Episode 30.

  • IBM z16 – lots of great topics we are do on this in future episodes.

  • IBM z/OS Requirements have moved into the aha! tool, and they are called Ideas .

Mainframe – z/OS Management Services Catalogs: Importance of z/OSMF Workflows

  • z/OS Management Services Catalog, zMSC, allows you to customize a z/OSMF
    Workflow for your enterprise, and publish it in a catalog for others to “click and use”.

    • zMSC Services can be very useful, as you can encode specific installation’s standards into a Service.

    • As you can guess, there are different role for these zMCS Services: Administrators and Users.

      • Administrators are those can customize and publish a Service (from a z/OSMF Workflow definition file), and allow Users to run it.
    • To get you started, IBM provides 7 sample Services which are common tasks that you might want to review and publish. These samples are:

      1. Delete an alias from a catalog
      2. Create a zFS file system
      3. Expand a zFS file system
      4. Mount a zFS file system
      5. Unmount a zFS file system
      6. Replace an SMP/E RECEIVE ORDER certificate
      7. Delete a RACF user ID
    • More are likely to be added, based on feedback.

    • Note, however, someone could add their own from a z/OSMF Workflow. The z/OSMF Workflows could come from:

      • The popular Open Source zorow repository.

      • Created from your own ecosystem, perhaps even using the z/OSMF Workflow Editor to help you create it.

    • zMSC Services are based on z/OSMF Workflows. You can see why the discussion on knowing z/OSMF Workflows is important.

    • Customers can grab workflows and make them services, and provide more checking and control than just a z/OSMF Workflow can do. They can also be run again and
      again from published Services meaning that the tasks of Workflow creation, assignment, and acceptance are not necessary.

    • Without z/OSMF Workflows none of zMSC is usable, so get your Workflows ready to make appropriate ones into Services.

Performance – System Recovery Boost (SRB) Early Experiences

  • System Recovery Boost
    provides boosts of two kinds:

    • Speed Boost – which is useful for those with subcapacity servers to make them full speed. Won’t apply to full speed customers.

    • zIIP Boost – which allows work normally not allowed to run on a zIIP, to run on a zIIP.

      • You can purchase temporary zIIP capacity if you like.
  • There are basically three major stages to the SRB function:

    1. Those on the IBM z15, to reduce outage time:

      • Shutdown – which allows you to have 30 minutes worth of boosting during shutdown. This function must be requested to be used each time.

      • IPL – which allows you to have 60 minutes worth of boosting during IPL. This function, provided by default, is on.

    2. Additional functions for Recovery Process Boost, provided on IBM z15. Extends to structure or connectivity recovery, for instance.

    3. Newer additional functions for Recovery Process Boost, specifically on IBM z16, for stopping and starting certain middleware.

  • Martin has several early field experience, which he has summarised in four blog posts:

    1. Really Starting Something

    2. SRB And SMF

    3. Third Time’s The Charm For SRB – Or Is it?

    4. SRB And Shutdown Martin has noticed that Shutdown boosts might not be used as much.

  • It is important to know that SRB new function APARs have been released, and all have the SMP/E FIXCAT of IBM.Function.SystemRecoveryBoost.
    Some of these functions may or may not go back to the IBM z15.

  • Martin’s SMB conclusions are:

    • “Not one and done”. We’ve seen updates to this technology, which is a great thing to see expanding!

    • Good idea to run a small implementation project. Know what kind of advantage you are receiving from this function, which probably entails doing a “before” and “after” comparison.

    • Pay attention to your zIIP Pool Weights. An LPAR undergoing a boost might use a lot of zIIP; Make sure other LPARs have adaquate zIIP pool weights to protect them.

    • For Shutdown consider automation. This allows you to leave no SRB offering behind.

    • Take advantage of the available monitoring for effective usage.

  • Tell us of your SRB experience!

Topics – Stickiness

  • This topics explores what makes some technologies sticky, and some not, which Martin started in one of his blogs. Almost went with this as the podcast episode title.

  • Martin and Marna discuss some of the attributes that are important for continuing to be used, and what makes a function fall away over time.

    • Value – Needs to a balance between making your life better, valid, and (somewhat) financial. Important points are productivity , reliability, value in doing something it is hard to do. Familiarity is nice value.

    • Completeness – What features are there and missing. Example of this is Shortcuts, which has added a lot of functions over time. It can be a journey, and have lots of competitors.

    • Usability and immediacy – An unsuccessful attempt was Martin’s numeric keypad without the ability to know what the keys were for with some fumbling. Streamdeck was programmatic and helped by showing what the keys were for.

    • Reliability – How infrequently must it fail for it to be acceptable? 1%? 10%? It depends.

    • Setup complexity – Most people want them simple to set up. Martin likes to tailor capability. Marna likes it to be easy.

Out and about

  • Marna and Martin are both planning on being in SHARE, Columbus, August 22–26, 2022.

  • Martin will be talking about zIIP Capacity & Performance, with a revised presentation. Marna has a lot of sessions and labs, as usual – including the new z/OS on IBM z16!

On the blog

So It Goes

WLM-Managed Initiators And zIIP

One item in the z/OS 2.5 announcement caught my eye. Now 2.5 is becoming more prevalent it’s worth talking about it. It is zIIP and WLM-Managed Initiators.

WLM-Managed Initiators

The purpose of WLM-Managed Initiators is to balance system conditions against batch job initiation needs:

  • Start too many initiators and you can cause CPU thrashing.
  • Start too few and jobs will wait for an excessive period to get an initiator.

And this can be used to both delay job initiation as well as choosing where to start an initiator.

Prior to z/OS 2.5 General-Purpose Processor (GCP) capacity would be taken into account but zIIP capacity wouldn’t be. With z/OS 2.5 zIIP is also taken into account.

What WLM Knows About

But this raises a question: How does WLM know how zIIP intensive a job will be – before it’s even started?

Well, WLM isn’t clairvoyant. It doesn’t know the proclivities of an individual job before it starts. In fact it doesn’t know anything about individual job names. It can’t say, for instance, “job FRED always burns a lot of GCP CPU”.

So let’s review what WLM actually does know:

  • It knows initiation delays – at the Service Class level. This shows up as MPL Delay.1
  • It knows the usual components of velocity – again, at the Service Class level. (For example GCP Delay and GCP Using.)
  • It knows system conditions. And now zIIP can be taken into account.
  • It knows – at the Service Class level – resource consumption by a typical job. And this now extends to zIIP.

How Prevalent Is zIIP In Batch?

zIIP is becoming increasingly prevalent in the Batch Window, often in quite an intense manner. Examples of drivers include:

  • Java Batch
  • Db2 Utilities
  • A competitive sort product

When we2 look at customer systems we often see times of the night where zIIP usage is very high. (Often we’re not even focusing on Batch but see it out of the corner of our eye.)

(Actually this usage tends to be quite spiky. For example, Utilities windows tend to be of short duration but very intensive.)

So, it’s worth looking at the zIIP pool for the batch window to understand this.

(I’ll say, in passing, often coupling facility structures are intensively accessed in similar, sometimes contemporaneous, bursts. As well as GCP and memory.)

I’m labouring the point because this trend of zIIP intensiveness in parts of the batch window might be a bit of a surprise.

Conclusion

If we accept WLM will now manage initiators’ placement (in system terms) and starting (in timing terms) with regard to zIIP we probably should classify jobs to service classes accordingly.

It’s suggested zIIP jobs should be in different service classes to non-zIIP ones. With the possible exception of Utilities jobs I don’t think this is realistic. (Is java batch businesswise different from the rest?) But if you can achieve it without much distortion of your batch architecture WLM will take into account zIIP better in z/OS 2.5. One reason why you might not be able to do this is if the zIIP-using jobs are multi-step and only some of the steps are zIIP-intensive.


  1. Not to be confused with MPL Delay for WLM-Managed Db2 Stored Procedures Server address spaces, which is generally more serious. Metapoint: It pays to know what a service class is for. 

  2. Teamly “we”, not “Royal We”. :-) 

SRB And Shutdown

I’ve written several times about System Recovery Boost (SRB) so I’ll try to make this one a quick one.

For reference, previous posts were:

From that last one’s title it clearly wasn’t (the end of the matter). It’s worth reading the table with timestamps again.

Notice in particular the first interval – the last one before shutdown – is not boosted. In fact I note that fact in the post.

Some months on I think I now understand why – and I think it’s quite general.

To enable SRB at all you have to enable it in PARMLIB. But it is the default – so you’d have to actively disable it if the z15 (or now z16) support is installed. (One customer has told me they’ve actually done that.)

But enablement isn’t the same as actually invoking it:

  • For IPL you don’t have to do anything. You get 60 minutes’ boost automatically.
  • For shutdown you have to explicitly start the boost period – using the IEASDBS procedure.

What I think is happening is installations have SRB enabled but don’t invoke IEASDBS to initiate shutdown.

I would evolve shutdown operating procedures to include running IEASDBS. In general, I think SRB (and RPB, for that matter) would benefit from careful planning. So consider running a mini project when installing z15 or z16. If you’re already on z15 note there are enhancements in this area for z16. I also like that SRB / RPB is continuing to evolve. It’s not a “one and done”.

By the way there’s a nice Redpiece on SRB: Introducing IBM Z System Recovery Boost. It’s well worth a read.

In parting, I should confess I haven’t established how CPU intensive shutdown and IPL are, more how parallel. Perhaps that’s something I should investigate in the future. If I draw worthwhile conclusions I might well write about them here.