When Good Service Definitions Turn Bad

I was about to comment it’s been a while since I wrote about WLM but, in researching for this post, I discover it isn’t. The last post was WLM-Managed Initiators And zIIP.

I seem to be telling a lot of customers their WLM service definition really needs some maintenance. In fact it’s every customer I’ve worked with over the past few years. You might say “well, it’s your job to analyse WLM for customers”. To some extent that’s true. However, my motivation is customer technical health rather than meeting some Services quota. (I don’t have one, thankfully.) So, if I say it I mean it.

I thought I’d explore why WLM service definition maintenance is important. Also when to do it.

Why Review Your Service Definition?

Service definitions have two main components:

  • Classification rules
  • Goals

Both yield reasons for review.

Classification Rules

I often see work misclassified. Examples include

  • Work in SYSSTC that shouldn’t be – such as Db2 Engine address spaces.
  • Work that should be Importance 1 but isn’t – again Db2 Engine but also MQ Queue Managers.
  • Work that’s new and not adequately classified. Recall, as an example, if you don’t set a default for started tasks the default is to classify them to SYSSTC. For batch that would be SYSOTHER.

So, classification rules are worth examining occasionally.

Goals

Goal values can become out of date for a number of reasons. Common ones are:

  • Transactions have become faster
    • Due to tuning
    • Due to technology improvements
  • Business requirements have changed. For example, orchestrating simple transactions into ever more complex flows.
  • Talking of business requirements, it might become possible to do better than your Service Level Agreement says (or Service Level Expectation, if that’s what you really have.
  • A change from using I/O Priority Management to not.

Goal types can also become inappropriate. A good example of this is recent changes to how Db2 DDF High Performance DBATs are reported to WLM, as I articulate in my 2022 “Db2 and WLM” presentation.

But why would goal values matter? Take the case where you do far better than your goal. I often see this and I explain that if performance deteriorated to the point where the work just met goal people would probably complain. They get used to what you deliver, even if it’s far better than the SLA. And with a goal that is too lax this is almost bound to happen some time.

Conversely, an unrealistically tight goal isn’t helpful; WLM will give up (at least temporarily) on a thoroughly unachievable goal. Again misery.

So, goal values and even types are worth examining occasionally – to make sure they’re appropriate and achievable but not too easy.

Period Durations

When I examine multi-period work (typically DDF or Batch) I often see almost all work ending in Period 1. I would hope to see something more than 90% ending in Period 1, but often it’s well above 99%. This implies, rhetorically, there is no heavy stuff. But I think I would want to protect against heavy stuff – so a higher-than-99%-in-Period-1 scenario is not ideal.

So, occasionally check on period durations for multi-period work. Similarly, think about whether the service class should be multi-period. (CICS transaction service classes, by the way, can’t be multi-period.)

When Should You Review Your Service Definition?

It’s worth some kind of review every year or so, performance data in hand. It’s also worth it whenever a significant technology change happens. It might be when you tune Db2’s buffer pools, or maybe when you get faster disks or upgrade your processor. All of these can change what’s achievable.

In the aftermath of a crisis is another good time. If you establish your service definition didn’t adequately protect what it should have then fixing that could well prevent a future crisis. Or at least ameliorate it. (I’m biased towards crises, by the way, as that’s what often gets me involved – usually in the aftermath and only occasionally while it’s happening.)

And Finally

Since I started writing this post I’ve “desk checked” a customer’s WLM Service Definition. (I’ve used my Open Source sd2html code to examine the XML unloaded from the WLM ISPF Application.)

I didn’t expect to – and usually I’d also have RMF SMF.

I won’t tell you what I told the customer but I will say there were quite a few things I could share (and that I had to write new function in sd2html to do so).

One day I will get the SMF and I’ll be able to do things like check goal appropriateness and period durations.

But, to repeat what I started with, every WLM Service Definition needs periodic maintenance. Yours probably does right now.

And, as a parting shot here’s a graph I generated from a table sd2html produces:

It shows year-level statistics for modifications to a different customer’s WLM Service Definition. As you can see, activity comes in waves. Practically, that’s probably true for most customers. So when’s the next wave due?

Heading Back Into Db2- Architecture Part 1

I loftily talk about “architecture” a lot. What I’m really getting at is gleaning an understanding of an installation’s components – hardware and software – and some appreciation of what they’re for, as well as how they behave.

When I started doing Performance and Capacity – many years ago – I was less sensitive to the uses to which the machines were put. In fact, I’d argue “mainstream” Performance and Capacity doesn’t really encourage much understanding of what I call architecture.

To be fair, the techniques for gleaning architectural insight haven’t been well developed. Much more has been written and spoken about how to tune things.

Don’t get me wrong, I love tuning things. But my origin story is about something else: Perhaps programming, certainly tinkering. Doing stuff with SMF satisfies the former (though I have other projects to scratch that itch). Tinkering, though, takes me closer to use cases.

Why Db2?

What’s this got to do with Db2?

First, I should say I’ve been pretending to know Db2 for almost 30 years. 🙂 I used to tune Db2 – but then we got team mates who actually did tune Db2. And I never lost my affinity for Db2, but I got out of practice. And the tools I was using got rusty, some of them not working at all now.

I’m heading back into Db2 because I know there is an interesting story to tell from an architectural point of view. Essentially, one could morph tuning into asking a simple question: “What is this Db2 infrastructure for and how well suited is the configuration to that purpose?” That question allows us to see the components, their interrelationships, their performance characteristics, and aspects of resilience.

So let me give you two chunks of thinking, and I’ll try to give you a little motivation for each:

  • Buffer pools
  • IDAA

I am, of course, talking mainly about Db2 SMF. I have in the past also taken DSNZPARM and Db2 Catalog from customers. I expect to do so again. (On the DSNZPARM question, Db2 Statistics Trace actually is a better counterpart – so that’s one I probably won’t bother asking customers to send.)

I’m experimenting with the structure in my two examples. For each I think two subsections are helpful:

  • Motivation
  • Technique Outline

If this structure is useful future posts might retain it.

Buffer Pools

He’re we’re talking about both local pools (specific to each Db2 subsystem) and group (shared by the whole datasharing group, but maybe differentially accessed).

Motivation

Some Datasharing environments comprise Db2 subsystems that look identical. If you see one of these you hope the work processed by each Db2 member (subsystem) in the group is meant to be the same. The idea here is that the subsystems together provide resilience for the workload. If the Db2 subsystems don’t look identical you hope it’s because they’re processing different kinds of work (despite sharing the data).

I think that distinction is useful for architectural discussions.

More rarely, whole Datasharing groups might be expected to resemble each other. For example, if a parallel sysplex is a backup for another (or else shares a partitioned portion of the workload). Again, a useful architectural fact to find (or not find).

Technique Outline

Db2 Statistics Trace IFCID 202 data gives a lot of useful information about individual buffer pools – at the subsystem level. In particular QBST section gives:

  • Buffer pool sizes
  • Buffer pool thresholds – whether read or write or for parallelism
  • Page frame sizes

At the moment I’m creating CSV files for each of these. And trialing it with each customer I work with. I’m finding cases where different members are set up differently – often radically. And also some where cloning is evident. From SMF I don’t think I’m going to see what the partitioning scheme is across clones – though some skew in terms of traffic might help tell the story.

Let me give one very recent example, which the customer might recognise but doesn’t expose them: They have two machines and each application group has a pair of LPARs, one on each machine. On each of these LPARs there is a Db2 subsystem. Each LPAR’s Db2 subsystem has identical buffer pool setups – which are different from other applications’ Db2’s.

Db2 Statistics Trace IFCID 230 gives a similar view for whole Datasharing groups. Here, of course, the distinction is between groups, rather than within a group.

IDAA

IDAA is IBM’s hardware accelerator for queries, coming in two flavours:

  • Stand-alone, based on System P.
  • Using System Z IFLs.

Motivation

The purpose of IDAA servers is to speed up SQL queries (and, I suppose, to offload some CPU). Therefore I would like to know if a Db2 subsystem uses an IDAA server. Also whether Db2 subsystems share one.

IDAA is becoming increasingly common so sensitivity to the theme is topical.

Technique Outline

Db2 Statistics Trace IFCID 2 has a section Q8ST which describes the IDAA servers a Db2 subsystem is connected to. (These are variable length sections so, perhaps unhelpfully the SMF triplet that describes them has 0 for length – but there is a technique for navigating them.)

A few notes:

  • The field Q8STTATE describes whether the IDAA server is online to the Db2 subsystem.
  • The field Q8STCORS is said to be core count but really you have to divide by 4 (the SMT threads per core) to get a credible core count – and hence model.
  • There can be multiple servers per physical machine. But we don’t have a machine serial number in Statistics Trace to tie the servers on the same machine together. But some fields behave as if they are one per machine, rather than one per server. So we might be able to deduce which servers are on which machine. For example Q8STDSKA – which also helps distinguish between generations (eg 48TB vs 81TB).

Wrap Up

I’m sure there’s much more I can do with SMF, from a Db2 architecture point of view. So expect more posts eventually. Hence the “Part 1” in the title. And, I think it’s going to be a lot of fun continuing to explore Db2 SMF in this way.

And, of course, I’m going to keep doing the same thing for non-Db2 infrastructure.

One other note: I seem to be biased towards “names in frames” rather than traffic at the moment. The sources of data do indeed allow analysis of eg “major user” rather than “minor user”. This is particularly relevant here in the case of IDAA. One should be conscious of “uses heavily” versus “is connected to but hardly uses at all”. That story can be told from the data.

Making Of

I’m continuing with the idea that the “making of” might be interesting (as I said in Coupling Facility Structure Versions). It occurs to me it might show people that you don’t have to have a perfect period of time and place to write. That might be encouraging for budding writers. But it might be stating “the bleedin’ obvious”. 🙂

This one was largely written on a flight to Toronto, for a customer workshop. Again, it poured out of my head and the structure naturally emerged. There might be a pattern here. 🙂

As an aside, I’m writing this on a 2021 12.9” iPad Pro – using Drafts. I’m in Economy – as always. I’m not finding it difficult to wield the iPad, complete with Magic Keyboard, in Economy Class. I’m certain I would find my 16” MacBook Pro cumbersome in the extreme.

And, of course, there was tinkering after I got home, just before publishing (but after a period of reflection).

Coupling Facility Structure Versions

When I see an 8-byte field in a record I think of three possibilities, but I’m prepared to discover the field in question is none of them. The three prime possibilities are:

  1. A character field
  2. A 64-bit counter
  3. A STCK value

An interesting case occurs in SMF 74 Subtype 4: Two similar fields – R744SVER and R744QVER – are described as structure versions.

Their values are structure-specific. Their description is terse (as is often the case). By the way that’s not much of a criticism; One would need to write War And Peace to properly describe a record. I guess I’m doing that, one blog post at a time. 🙂

Some Detective Work

With such a field the first thing you do is get the hex(adecimal) representations of some sample contents. In my case using REXX’s c2x function. Here’s an example of R744SVER: D95FCC96 70EEB410.

A Character Field?

While not foolproof, it would be hard to mistake an EBCDIC string’s hex values for anything else. And vice versa. (Likewise ASCII, as it happens.) I think you’ll agree very few of the bytes in the above example look like printable EBCDIC characters.

These fields look nothing like EBCDIC.

A Counter?

I would expect most counters to not be close to exhausting the field’s range. So I would expect the top bits to not be set. Our above example is close to wrapping.

While these values tend to have something like ‘2x’ for the top byte they don’t look like “unsaturated” counters.

So they’re not likely to be counters.

A STCK Value?

I put some sample values into a STCK formatter on the web. I got credible values – dates in 2020, 2021, and 2022.

For the example above I get “07-Mar-2021 06:34:00 ” – which is a very believable date.

So this seems like the best guess by far.

How Do We Interpret This Timestamp?

If we accept these fields are timestamps how do we interpret them?

My view is that this timestamp represents when the structure was allocated, possibly for the first time but more likely a reallocation. (And I can’t see which of these it is.)

Why might this happen?

I can think of a few reasons:

  • To move the structure to a different coupling facility. This might be a recovery action.
  • To restart the coupling facility. This might be to upgrade to a later CFLEVEL. Or indeed a new machine generation.
  • To resize the structure. This is a little subtle: I wouldn’t think, in general, you would reallocate to resize unless you were having to raise the structure’s maximum size.

One thing I’m not sure about is whether there is a time zone offset from GMT. I guess we’ll see what appears credible. I will say that hours and minutes are slightly less important in this than dates. I’m definitely seeing what looks like application-oriented changes such as MQ shared message queue structures appearing to pop into existence.

Conclusion

Guessing field formats is fun, though it is far from foolproof.

I’m a little tentative about this. As with many such things I want to see how customers react to me presenting these dates and times. Call it “gaining experience”.

But I do think this is going to be a useful technique – so I’ve built it into my tabular reporting that lists structures.

As always, more on this when I have something to share.

Making Of

I’m experimenting with the idea that somebody might be interested in how this blog post was made.

The original idea came from a perusal of the SMF 74-4 manual section. It was written in chunks, largely on one day. Two short train journeys, two short tube journeys and a theatre interval yielded the material. It seemed to pour out of my head, and the structure very naturally emerged. Then a little bit of finishing was required – including researching links – a couple of weeks later.

Mainframe Performance Topics Podcast Episode 31 “Take It To The Macs”

This is the first blog post I’ve written on my new work MacBook Pro. While it’s been a lot of work moving over it’s a better place as it’s an Apple Silicon M1 Max machine with lots of memory and disk space.

That’s nice, but what’s the relevance to podcasting?

Well, it’s very warm here in the UK right now and I’ve been on video calls for hours on end. Yes, the machine gets warm – but possibly not from its load. But, importantly, there has been zero fan noise.

Fan noise has been the bedevilment of recording audio. Hopefully that era is now over – and just maybe the era of better sound quality in my recordings is upon us. (See also the not-so-secret Aftershow for this episode.)

As usual, Episode 31 was a lot of fun to make. I hope you enjoy it!

Episode 31 “Take it to the Macs” long show notes.

This episode is about our After Show. (What is that?)

Since our last episode, we were both in person at SHARE in Dallas, TX.

What’s New

  • More new news for CustomPac ServerPac removal date, which has been extended past January 2022. The CustomPac (ISPF) ServerPac removal date
    from Shopz for all ServerPacs will be July 10, 2022. Make sure you order before that date if you want a non-z/OSMF ServerPac. CBPDO is still available and unaffected

  • Data Set File System released OA62150 closed April 28th, 2022 only on z/OS V2.5,
    which we talked about in Episode 30.

  • IBM z16 – lots of great topics we are do on this in future episodes.

  • IBM z/OS Requirements have moved into the aha! tool, and they are called Ideas .

Mainframe – z/OS Management Services Catalogs: Importance of z/OSMF Workflows

  • z/OS Management Services Catalog, zMSC, allows you to customize a z/OSMF
    Workflow for your enterprise, and publish it in a catalog for others to “click and use”.

    • zMSC Services can be very useful, as you can encode specific installation’s standards into a Service.

    • As you can guess, there are different role for these zMCS Services: Administrators and Users.

      • Administrators are those can customize and publish a Service (from a z/OSMF Workflow definition file), and allow Users to run it.
    • To get you started, IBM provides 7 sample Services which are common tasks that you might want to review and publish. These samples are:

      1. Delete an alias from a catalog
      2. Create a zFS file system
      3. Expand a zFS file system
      4. Mount a zFS file system
      5. Unmount a zFS file system
      6. Replace an SMP/E RECEIVE ORDER certificate
      7. Delete a RACF user ID
    • More are likely to be added, based on feedback.

    • Note, however, someone could add their own from a z/OSMF Workflow. The z/OSMF Workflows could come from:

      • The popular Open Source zorow repository.

      • Created from your own ecosystem, perhaps even using the z/OSMF Workflow Editor to help you create it.

    • zMSC Services are based on z/OSMF Workflows. You can see why the discussion on knowing z/OSMF Workflows is important.

    • Customers can grab workflows and make them services, and provide more checking and control than just a z/OSMF Workflow can do. They can also be run again and
      again from published Services meaning that the tasks of Workflow creation, assignment, and acceptance are not necessary.

    • Without z/OSMF Workflows none of zMSC is usable, so get your Workflows ready to make appropriate ones into Services.

Performance – System Recovery Boost (SRB) Early Experiences

  • System Recovery Boost
    provides boosts of two kinds:

    • Speed Boost – which is useful for those with subcapacity servers to make them full speed. Won’t apply to full speed customers.

    • zIIP Boost – which allows work normally not allowed to run on a zIIP, to run on a zIIP.

      • You can purchase temporary zIIP capacity if you like.
  • There are basically three major stages to the SRB function:

    1. Those on the IBM z15, to reduce outage time:

      • Shutdown – which allows you to have 30 minutes worth of boosting during shutdown. This function must be requested to be used each time.

      • IPL – which allows you to have 60 minutes worth of boosting during IPL. This function, provided by default, is on.

    2. Additional functions for Recovery Process Boost, provided on IBM z15. Extends to structure or connectivity recovery, for instance.

    3. Newer additional functions for Recovery Process Boost, specifically on IBM z16, for stopping and starting certain middleware.

  • Martin has several early field experience, which he has summarised in four blog posts:

    1. Really Starting Something

    2. SRB And SMF

    3. Third Time’s The Charm For SRB – Or Is it?

    4. SRB And Shutdown Martin has noticed that Shutdown boosts might not be used as much.

  • It is important to know that SRB new function APARs have been released, and all have the SMP/E FIXCAT of IBM.Function.SystemRecoveryBoost.
    Some of these functions may or may not go back to the IBM z15.

  • Martin’s SMB conclusions are:

    • “Not one and done”. We’ve seen updates to this technology, which is a great thing to see expanding!

    • Good idea to run a small implementation project. Know what kind of advantage you are receiving from this function, which probably entails doing a “before” and “after” comparison.

    • Pay attention to your zIIP Pool Weights. An LPAR undergoing a boost might use a lot of zIIP; Make sure other LPARs have adaquate zIIP pool weights to protect them.

    • For Shutdown consider automation. This allows you to leave no SRB offering behind.

    • Take advantage of the available monitoring for effective usage.

  • Tell us of your SRB experience!

Topics – Stickiness

  • This topics explores what makes some technologies sticky, and some not, which Martin started in one of his blogs. Almost went with this as the podcast episode title.

  • Martin and Marna discuss some of the attributes that are important for continuing to be used, and what makes a function fall away over time.

    • Value – Needs to a balance between making your life better, valid, and (somewhat) financial. Important points are productivity , reliability, value in doing something it is hard to do. Familiarity is nice value.

    • Completeness – What features are there and missing. Example of this is Shortcuts, which has added a lot of functions over time. It can be a journey, and have lots of competitors.

    • Usability and immediacy – An unsuccessful attempt was Martin’s numeric keypad without the ability to know what the keys were for with some fumbling. Streamdeck was programmatic and helped by showing what the keys were for.

    • Reliability – How infrequently must it fail for it to be acceptable? 1%? 10%? It depends.

    • Setup complexity – Most people want them simple to set up. Martin likes to tailor capability. Marna likes it to be easy.

Out and about

  • Marna and Martin are both planning on being in SHARE, Columbus, August 22–26, 2022.

  • Martin will be talking about zIIP Capacity & Performance, with a revised presentation. Marna has a lot of sessions and labs, as usual – including the new z/OS on IBM z16!

On the blog

So It Goes

WLM-Managed Initiators And zIIP

One item in the z/OS 2.5 announcement caught my eye. Now 2.5 is becoming more prevalent it’s worth talking about it. It is zIIP and WLM-Managed Initiators.

WLM-Managed Initiators

The purpose of WLM-Managed Initiators is to balance system conditions against batch job initiation needs:

  • Start too many initiators and you can cause CPU thrashing.
  • Start too few and jobs will wait for an excessive period to get an initiator.

And this can be used to both delay job initiation as well as choosing where to start an initiator.

Prior to z/OS 2.5 General-Purpose Processor (GCP) capacity would be taken into account but zIIP capacity wouldn’t be. With z/OS 2.5 zIIP is also taken into account.

What WLM Knows About

But this raises a question: How does WLM know how zIIP intensive a job will be – before it’s even started?

Well, WLM isn’t clairvoyant. It doesn’t know the proclivities of an individual job before it starts. In fact it doesn’t know anything about individual job names. It can’t say, for instance, “job FRED always burns a lot of GCP CPU”.

So let’s review what WLM actually does know:

  • It knows initiation delays – at the Service Class level. This shows up as MPL Delay.1
  • It knows the usual components of velocity – again, at the Service Class level. (For example GCP Delay and GCP Using.)
  • It knows system conditions. And now zIIP can be taken into account.
  • It knows – at the Service Class level – resource consumption by a typical job. And this now extends to zIIP.

How Prevalent Is zIIP In Batch?

zIIP is becoming increasingly prevalent in the Batch Window, often in quite an intense manner. Examples of drivers include:

  • Java Batch
  • Db2 Utilities
  • A competitive sort product

When we2 look at customer systems we often see times of the night where zIIP usage is very high. (Often we’re not even focusing on Batch but see it out of the corner of our eye.)

(Actually this usage tends to be quite spiky. For example, Utilities windows tend to be of short duration but very intensive.)

So, it’s worth looking at the zIIP pool for the batch window to understand this.

(I’ll say, in passing, often coupling facility structures are intensively accessed in similar, sometimes contemporaneous, bursts. As well as GCP and memory.)

I’m labouring the point because this trend of zIIP intensiveness in parts of the batch window might be a bit of a surprise.

Conclusion

If we accept WLM will now manage initiators’ placement (in system terms) and starting (in timing terms) with regard to zIIP we probably should classify jobs to service classes accordingly.

It’s suggested zIIP jobs should be in different service classes to non-zIIP ones. With the possible exception of Utilities jobs I don’t think this is realistic. (Is java batch businesswise different from the rest?) But if you can achieve it without much distortion of your batch architecture WLM will take into account zIIP better in z/OS 2.5. One reason why you might not be able to do this is if the zIIP-using jobs are multi-step and only some of the steps are zIIP-intensive.


  1. Not to be confused with MPL Delay for WLM-Managed Db2 Stored Procedures Server address spaces, which is generally more serious. Metapoint: It pays to know what a service class is for. 

  2. Teamly “we”, not “Royal We”. :-) 

SRB And Shutdown

I’ve written several times about System Recovery Boost (SRB) so I’ll try to make this one a quick one.

For reference, previous posts were:

From that last one’s title it clearly wasn’t (the end of the matter). It’s worth reading the table with timestamps again.

Notice in particular the first interval – the last one before shutdown – is not boosted. In fact I note that fact in the post.

Some months on I think I now understand why – and I think it’s quite general.

To enable SRB at all you have to enable it in PARMLIB. But it is the default – so you’d have to actively disable it if the z15 (or now z16) support is installed. (One customer has told me they’ve actually done that.)

But enablement isn’t the same as actually invoking it:

  • For IPL you don’t have to do anything. You get 60 minutes’ boost automatically.
  • For shutdown you have to explicitly start the boost period – using the IEASDBS procedure.

What I think is happening is installations have SRB enabled but don’t invoke IEASDBS to initiate shutdown.

I would evolve shutdown operating procedures to include running IEASDBS. In general, I think SRB (and RPB, for that matter) would benefit from careful planning. So consider running a mini project when installing z15 or z16. If you’re already on z15 note there are enhancements in this area for z16. I also like that SRB / RPB is continuing to evolve. It’s not a “one and done”.

By the way there’s a nice Redpiece on SRB: Introducing IBM Z System Recovery Boost. It’s well worth a read.

In parting, I should confess I haven’t established how CPU intensive shutdown and IPL are, more how parallel. Perhaps that’s something I should investigate in the future. If I draw worthwhile conclusions I might well write about them here.

Engineering – Part Six – Defined Capacity Capping Considered Harmful?

For quite a while now I’ve been able to do useful CPU analysis down at the individual logical processor level. In fact this post follows on from Engineering – Part Five – z14 IOPs – at a discreet distance.

I can’t believe I haven’t written about Defined Capacity Capping before – but apparently I haven’t.

As you probably know such capping generally works by introducing a “phantom weight”. This holds the capped LPAR down – by restricting it to below its normal share (of the GCP pool). Speaking of GCPs, this is a purely GCP mechanism and so I’ll keep it simple(r) by only discussing GCPs.

But have you ever wondered how RMF (or PR/SM for that matter) accounts for this phantom weight?

Well, I have and I recently got some insight by looking at engine-level GCP data. Processing at the interval and engine level yields some interesting insights.

But let me first review the data I’m using. There are three SMF record types I have to hand:

  • 70-1 (RMF CPU Activity)
  • 99-14 (Processor Topology)
  • 113 (HIS Counters)

I am working with a customer with 8 Production mainframes (a mixture of z14 and z15 multi-drawer models). Most of them have at least one z/OS LPAR that hits a Defined Capacity cap – generally early mornings across the week’s data they’ve sent.

None of these machines is terribly busy. And none of them are even close to having all physical cores characterised.

Vertical Weights

In most cases the LPARs only have Vertical High (VH) logical GCPs. I can calculate what the weight is for a VH as it’s a whole physical GCP’s worth of weight: Divide the total pool weight by the total number of physical processors in the pool. For example, if the LPARs’ weights for the pool add up to 1000 and there are 5 physical GCPs in the pool a physical GCP’s worth of weight is 200 – and so that’s the polar weight of a VH logical GCP. (And is directly observable as such.)

Now here’s how the logical processors are behaving:

  • When not capped all the logical processors have a full processor’s weight (as expected).
  • When capped weights move somewhat from higher-numbered logical GCPs to lower-numbered ones.

The consequence is some of the higher numbered ones become Vertical Lows (VLs) and occasionally a VH turns into a Vertical Medium (VM). What I’ve also observed is the remaining VH’s get polar weights above a full engine’s weight – which they obviously can’t entirely use.

And we know all this from SMF 70 Subtype 1 records, summarised in each RMF interval at the logical processor level.

Logical Core Home Addresses

But what are the implications of Defined Capacity capping?

Obviously the LPAR’s access to GCP CPU is restricted – which is the intent. And, almost as obviously, some workloads are likely to be hit. You probably don’t need a lecture from me on the especial importance of having WLM set up right so the important work is protected under such circumstances. Actually, this post isn’t about that.

There are other consequences of being capped in this way. And this is really what this post is about.

When a logical processor changes polarisation PR/SM often reworks what are deemed “Home Addresses” for the logical processors:

  • For VH logical processors the logical processor is always dispatched on the same physical processor – which is its home address.
  • A VM logical processor isn’t entitled to a whole physical processor’s worth of weight. It has, potentially, to share with other logical processors. But it still has a home address. It’s just that there’s a looser correspondence between home address and where the VM is dispatched in the machine.
  • A VL logical processor has an even looser correspondence between its home address and where it is dispatched. (Indeed it has no entitlement to be dispatched at all.)

What I’ve observed – using SMF 99 Subtype 14 records – follows. But first I would encourage you to collect 99-14 as they are inexpensive. Also SMF 113, but we’ll come to that.

When SMF 70-1 says the LPAR is capped (and the weights shift, as previously described) the following happens: Some higher-numbered logical GCPs move home addresses – according to SMF 99-14. But, in my case, these are VL’s. So their home addresses are less meaningful.

In one case, and I don’t have an explanation for this, hitting the cap caused the whole LPAR to move drawers. And it moved back again when the cap was removed.

If the concept of a home address is less meaningful for a VL, why do we care that it’s moved? Actually, we don’t. We care about something else…

… From SMF 113 it’s observed that Cycles Per Instruction (CPI) deteriorates. Usually one measures this across all logical processors, or all logical processors in a pool. In the cases I’m describing these measures deteriorate. But there is some fine structure to this. In fact it’s not that fine…

… The logical processors that turned from VH to VL experience CPI values that move from the reasonable 3 or so to several hundred. This suggests to me these VL logical processors are being dispatched remote from where the data is. You could read that as “remote from where the rest of the LPAR is”. There might also be a second effect of being dispatched to cores with effectively empty local caches (Levels 1 and 2). Note: Cache contents won’t move with the logical processor as it gets redispatched somewhere else.

So the CPI deterioration factor is real and can be significant when the LPAR is capped.

Conclusion

There are two main conclusions:

  • Defined Capacity actual capping can have consequences – in terms of Cycles Per Instruction (CPI).
  • There is value in using SMF 70-1, 99-14, and 113 to understand what happens when an LPAR is Defined Capacity capped. And especially analysing the data at the individual logical processor level.

By the way, I haven’t mentioned Group Capping. I would expect it to be similar – as the mechanism is.

Mainframe Performance Topics Podcast Episode 30 “Choices Choices”

It’s been a mighty long time since Marna and I got a podcast episode out – and actually we started planning this episode long ago. It’s finding good times to record that does it to us, as planning can be a bit more asynchronous.

Hopefully this delay has enabled some of you to catch up with the series. Generally the topics stand the test of time, not being awfully time critical.

And I was very pleased to be able to record a topic with Scott Ballentine. It’s about a presentation we wrote – which we synopsise. I think this synopsis could prove inspiring to some people – whether they be customers or product developers.

With luck this might well be the last episode we record where I have to worry about fan noise. Being able to dispense with noise reduction might help my voice along a little, too. 🙂

The episode can be found here. The whole series can, of course, be found here or from wherever you get your podcasts.

Episode 30 “Choices Choices” Show notes

This episode is about our Topics Topics on choosing the right programming language for the job. We have a special guest joining us for the performance topic, Scott Ballentine.

Since our last episode, we were virtually at GSE UK, and IBM TechU. Martin also visited some Swedish customers.

What’s New

  • New news for CustomPac removal date, which has been extended past January 2022. The reason was to accommodate the desired Data Set Merge capability in z/OSMF which customers needed. z/OSMF Software Management will delivery the support in PH42048. Once that support is there, then ServerPac can exploit it. For the new withdrawl date, it is planned to be announced in 2Q2022.

  • Check out the LinkedIn article on the IBM server changing for FTPS users for software electronic delivery on April 30, 2021, from using TLS 1.0 and 1.1 to using TLS 1.2, with a dependency on AT-TLS.

    • If you are using HTTPS, you are not affected, and is recommended.

Mainframe – Only in V2.5

  • Let’s note: z/OS 2.3 EOS September 2022, z/OS 2.4 not orderable since end of January 2022

  • This topic was looking at some new functions that are only in z/OS V2.5. We wouldn’t necessarily expect anything beyond this point to be rolled back into V2.4.

    • Data Set File System, planned to be available in 1Q 2022.

      • Allows access to MVS sequentional or partitioned data sets from z/OS UNIX, that have certain formats.

      • Must be cataloged. Data set names are case insensitive.

      • Popular use cases would be looking at the system log after it has been saved in SDSF, editing with vi, and data set transfers with sftp.

      • Also will be useful with Ansible and DevOps tooling.

      • Serialization and security is just as if it was being accessed via ISPF.

      • There are mapping rules that you’ll need to understand. The path will begin with /dsfs.

    • Dynamic Change Master Catalog, yes, without an IPL

      • Must have a valid new master catalog to switch to

      • More so, you can put a comment on the command now

      • Helpful if you wanted to remove imbed or replicate and you haven’t been able to because it would have meant an outage.

    • RACF data base encryption has a statement of direction.

    • For scalability:

      • Increase z/OS Memory limit above 4TB to 16TB, with only 2GB frames above 4TB real. Good examples to exploit this is Java and zCX.

      • More Concurrently ”Open” VSAM Linear Datasets. Db2 exploits with Apar PH09189, and APAR PH33238 is suggested.

        • Each data set is represented by several internal z/OS data areas which reside in below the bar storage.

        • This support moves both VSAM and allocation data areas above the bar to reduce the storage usage in the below the bar storage area.

        • The support is optional, control is with ALLOCxx’s SYSTEM SWBSTORAGE with SWA will cause SWBs to be placed in 31-bit storage, as they have been in prior releases. ATB will cause SWBs to be eligible to be placed in 64-bit storage.

        • Can be changed dynamically and which option you are using can be displayed.

    • Noteable user requirements included:

      • ISPF Updates to ISPF in support of PDSE V2 member generations, and SUBMIT command to add an optional parameter SUBSYS.

        • Useful for directing jobs to the JES2 emergency subsystem
      • Access Method Services – IDCAMS – DELETE MASK has two new options TEST and EXCLUDE

        • TEST will return all the objects that would have been deleted if TEST wasn’t specified

        • EXCLUDE will allow a subset of objects that match the MASK to be excluded from those being deleted

        • Also, REPRO is enhanced to move its I/O buffers above the line to reduce the instances of out of space (878) ABENDs

    • z/OS Encryption Readiness Technology zERT

      • z/OS v2.5 adds support for detecting and responding to weak or questionable connections.

      • Policy based enforcement during TCP/IP connection establishment

        • Extending the Communications Server Policy Agent with new rules and actions

        • Detect weak application encryption and take action

        • Notification through messages and take action with your automation

        • Auditing via SMF records

        • Immediate Termination of connections is available through policy

  • There’s a lot of other stuff rolled back to V2.4

Performance – What’s the Use? – with special guest Scott Ballentine

  • This discussion is a summary from a joint presentation on Usage Data and IFAUSAGE

  • Useful for developers and for customers

  • The topic is motivational because customers can get a lot of value of out this usage data, and understand the provenance of IFAUSAGE data.

  • A macro vendors or anybody use can use to:

    • Show which products are used and how, including some numbers

    • Show names: Product Vendor, Name, ID, Version, Qualifier

    • Show numbers: Product TCB, Product SRB, FUNCTIONDATA

    • And let’s see how they turn into things you can use

  • The data is ostensibly for SCRT

    • Which is fed by SMF 70 and SMF 89

    • You might want to glean other value from IFAUSAGE

  • Scott talked about encoding via IFAUSAGE, appears in SMF 30 and 89-1

    • SMF 89-1: Software Levels query, Db2 / MQ subsystems query

    • SMF 30: Topology (e.g. CICS connecting to Db2 or MQ), Some numbers (Connections’ CPU)

    • Both SMF 30 and 89

      *FUNCTIONDATA: You could count transactions, but unsure of any IBM products using it. Db2 NO89 vs MQ Always On.

      • Slice the data a little differently with 30 vs 89
  • Some of these examples might inspire developers to think about how they code IFAUSAGE

    • Are your software levels coded right?
    • Do you use Product Qualifier creatively?
    • Do you fill in any of the numbers?
  • Have given the presentation four times

    • Technical University, October 2021

    • GSE UK Virtual Conference, November 2021

    • Nordics mainframe technical day

    • And our own internal teams which was meant to be a dry run, but actually was after the other three

  • It’s a living presentation which could given it at other venues, including to development teams.

    • Living also means it continues to evolve.
  • Hope is developers will delight customers by using IFAUSAGE right, and customers will take advantage in the way shown with reporting examples.

Topics – Choices, Choices

  • This topics is about how to choose a language to use for which purpose. Different languages were discussed for different needs.

  • Use Case: Serving HTML?

    • PHP / Apache on Mac localhost. Problem is to serve dynamically constructed HTML, which is used for Martin’s analysis.

    • PHP processes XML and can do file transfer and handle URL query strings. PHP 8 caused some rework. Fixes to sd2html for this.

    • Javascript / Node.js on Raspberry Pi. Good because plenty of ecosystem. Node also seems a moving target.

  • Host consideration: Running on e.g. laptop?

    • Python: Built-ins, for example CSV, Beautiful Soup, XML. However, Python 3 incompatible with Python 2, with Python 3.8 has nice “Walrus Operator”. Tabs and spaces can be irratating.

    • Automation tools: Keyboard Maestro, Shortcuts. On iPhone / iPad as well now as Mac OS.

    • Javascript in Drafts and OmniFocus. Cross platform programming models

    • AppleScript

  • Host consideration: Running on z/OS?

    • Assembler / DFSORT for high-volume data processing. Mapping macros shipped with many products.

    • REXX for everything else.

      • Martin uses it for orchestrating GDDM and SLR, to name two. As of z/OS 2.1 can process SMF nicely.

      • Health checks. Marna finds REXX easy to use for Health Checks, with lots of good samples.

      • z/OSMF Workflows. Easy to run REXX from a Workflow.

  • Overall lesson: Choose the language that respects the problem at hand.

    • Orchestrates what you need to orchestrate, runs in the environment you need it to run in, has lots of samples, has sufficient linguistic expression, is sufficiently stable, and performs well enough.

    • In summary, just because you have a hammer not everything is a nail. Not every oyster contains a PERL.

Out and about

  • Both Martin and Marna will be at SHARE, 27 – 30 March, in Dallas.

    • Marna has 6 sessions, and highlights the BYOD z/OSMF labs.

    • Martin isn’t speaking, but will be busy.

On the blog

So it goes.

Third Time’s The Charm For SRB – Or Is it?

Passing reference to Blondes Have More Fun – Or Do They?.

Yeah, I know, it’s a tortuous link. 🙂 And, nah, I never did own that album. 🙂

I first wrote about System Recovery Boost (SRB) and Recovery Process Boost (RPB) in SRB And SMF. Let me quote one passage from it:

It should also be noted that when a boost period starts the current RMF interval stops and a new one is started. Likewise when it ends that interval stops and a new one is started. So you will get “short interval” SMF records around the boost period.

I thought in this post I’d illustrate that. So I ran a query to show what happens around an IPL boost. I think it’s informative.

RMF Interval Start Interval Minutes Interval Seconds IPL Time (UTC) IPL Time (Local) zIIP Boost Note
May 9 07:58:51 5:02 302 May 2 07:13:08 May 2 08:13:08 No 1 2 3
Down Time May 9 07:58:51 – 08:15:01
May 9 08:19:50 9:03 543 May 9 07:15:01 May 9 08:15:01 Yes 4
May 9 08:28:54 15:00 900 May 9 07:15:01 May 9 08:15:01 Yes 5
May 9 08:43:55 15:00 900 May 9 07:15:01 May 9 08:15:01 Yes 5
May 9 08:58:55 15:00 900 May 9 07:15:01 May 9 08:15:01 Yes 5
May 9 09:13:55 1:31 91 May 9 07:15:01 May 9 08:15:01 Yes 6
May 9 09:15:25 13:29 809 May 9 07:15:01 May 9 08:15:01 No 7
May 9 09:28:55 15:00 900 May 9 07:15:01 May 9 08:15:01 No 8

Some notes:

  1. UTC Offset is + 1 Hour i.e. British Summer Time.
  2. IPL time was a week before.
  3. For some reason there was no shutdown boost.
  4. This is a short interval and the first with the boost. And note RMF starts a few seconds after IPL.
  5. These are full-length intervals (900 seconds or 15 minutes) with the boost.
  6. This is a short final interval with the boost.
  7. This is a short interval without the boost – which I take to be a form of re-sync’ing.
  8. This is a return to full-length intervals.

So you can see the down time between RMF cutting its final record and the IPL. Also between the IPL and RMF starting. You can also see the short intervals around starting and stopping the boost period.

Here’s an experimental way of showing the short intervals and the regular (15 minute) intervals.

The blue intervals are within the boost period, the orange outside it.

I don’t know if the above is helpful, but I thought it worth a try.

I don’t know that this query forms the basis for my Production code, but it just might. And I remain convinced that zIIP boosts (and, to a lesser extent, speed boosts) are going to be a fact of life we are going to have to get used to.

Finally, I’ll also admit I’m still learning about how RMF intervals work – so this has been a useful exercise for me.

Of course, when I say “finally”, I only mean “finally for this post”. I’ve a sneaking suspicion I’ve more to learn. Ain’t that always the way? 🙂

Stickiness

Question: What’s brown and sticky?

Answer: A stick. 🙂

It’s not that kind of stickiness I’m talking about.

I’ve experimented with lots of technologies over the years – hardware, software, and services. Some of them have stuck and many of them haven’t.

I think it’s worth exploring what makes some technologies stick(y) and some not – based on personal experience, largely centered around personal automation.

So let’s look at some key elements, with examples where possible.

Value

The technology has to provide sufficient value at a sufficiently low cost. “Value” here doesn’t necessarily mean money; It has to make a big enough contribution to my life.

To be honest, value could include hobbying as opposed to utility. For example, Raspberry Pi gives me endless hours of fun.

But value, generally for me, is productivity, reliability, enhancement, and automation in general:

  • Productivity: Get more done.
  • Reliability: Do it with fewer errors than I would.
  • Enhancement: Do things I couldn’t do.
  • Automation: Take me out of the loop of doing the thing.

Completeness

If a technology is obviously missing key things I’ll be less likely to adopt it.

But there is value – to go with the irritation – of adopting something early. You have to look at the prospects for building out or refinement.

An example of this is Siri Shortcuts (neé Workflow). It started out with much less function than it has now. But the rate of enhancement in the early days was breathtaking; I just knew they’d get there.

And the value in early adoption includes having a chance to understand the later, more complex, version. I learn incrementally. A good example of this might be the real and virtual storage aspects of z/OS.

Also, the sooner I adopt the earlier I get up the learning curve and get value.

I’m beta’ ing a few of my favourite apps and I’d be a hopeless beta tester for new function if I hadn’t got extensive experience of the app already.

Usability And Immediacy

A first attempt at push-button automation was using an external numeric keypad to automate editing podcast audio with Audacity.

The trouble with this is that you have to remember which button on the keypad does what. I fashioned a keyboard template but it wasn’t very good. (How do you handle the keys in the middle of the block?)

When I heard about StreamDeck I was attracted to the fact each key had an image and text on it. That gives immediate information about what the key does. I didn’t rework my Audacity automation to use it – as I coincidentally moved to Ferrite on iPad for my audio editing needs. But I built lots of new stuff using it.

So StreamDeck has usability a numeric keypad doesn’t. It’s also better than hot key combinations – which I do also rely on.

Reliability

What percent of the time does something have to fail for you to consider it unreliable? 1%? 10%?

I guess it depends on the irritation or damage factor:

  • If your car fails to start 1% of the time that’s really bad.
  • If “Ahoy telephone, change my watch face” fails 10% of the time that’s irritating but not much more.

The latter case is true of certain kinds of automation. But others are rock solid.

And, to my mind, Shortcuts is not reliable enough yet – particularly if the user base includes devices that aren’t right up to date. Time will tell.

Setup Complexity

I don’t know whether I like more setup complexity or less. 🙂 Most people, though, would prefer less. But I like tailorability and extensibility.

A good balance, though, is easy to get going but a high degree of extensibility or tailorability.

Conclusion

I’m probably more likely to try new technologies than most – in some domains. But in others I’m probably less likely to. Specifically, those domains I’m less interested in anyway.

The above headings summarise the essentials of stickiness – so I won’t repeat them here.

I will say the really sticky things for me are:

  • Drafts – where much of my text really does start (including this blog post).
  • OmniFocus – my task manager, without which a lot of stuff wouldn’t get done.
  • StreamDeck for kicking stuff off.
  • Keyboard Maestro for Mac automation.
  • Apple Watch
    • for health, audio playback, text input (yes really), and automation (a little).
  • Overcast – as my podcast player of choice.
  • iThoughts – for drawing tree diagrams (and, I suppose, mind mapping) 🙂

You might notice I haven’t put Shortcuts on the list. It almost makes it – but I find its usability questionable – and now there are so many alternatives.

There is an element of “triumph of hope over experience” about all this – but there is quite a lot of stickiness: Many things – as the above list shows – actually stick.

It’s perhaps cruel to note two services that have come unstuck – and I can say why in a way that is relevant to this post:

  • Remember The Milk was my first task manager but it didn’t really evolve much – and it needed to to retain my loyalty.
  • Evernote was my first note taking app. They got a bit distracted – though some of their experiments were worthwhile. And again evolution wasn’t their forte.

I suppose these two illustrate another point: Nothing lasts forever; It’s possible my Early 2023 stickiness list will differ from my Early 2022 one.

One final thought: The attitude of a developer / supplier is tremendously important. It’s no surprise several of the sticky things have acquired stickiness with a very innovative and responsive attitude. I just hope I can display some of that in what I do.