Mainframe Performance Topics Podcast Episode 29 “Hello Trello”

This was a fun episode to make, not least because it featured as a guest Miroslava Barahona Rossi. She’s one of my mentees from Brasil and she’s very good.

We made it over a period of a few weeks, fitting around unusually busy “day job” schedules for Marna and I.

And this one is one of the longest we’ve done.

Anyhow, we hope you enjoy it. And that “commute length” can be meaningful to more of you very soon.

Episode 29 “Hello Trello” long show notes.

This episode is about how to be a better z/OS installation specialist, z/OS capture ratios, and a discussion on using Trello. We have a special guest joining us for the performance topic, Miroslava Barahona Rossi.

Follow up to Waze topic in Episode 8 in 2016

  • Apple Maps Adds Accident, Hazard, and Speed Check Reporting using the iPhone, CarPlay, and Siri to iOS.

What’s New

  • Check out the LinkedIn article on the IBM server changing for FTPS users for software electronic delivery on April 30, 2021, from using TLS 1.0 and 1.1 to using TLS 1.2, with a dependency on AT-TLS.

  • If you are using HTTPS, you are not affected! This is recommended.

Mainframe – Being a Better Installation Specialist

  • This section was modeled after Martin’s “How to be a better Performance Specialist” presentation. It’s a personal view of some ideas to apply to the discipline.

  • “Better” here means lessons learned, not competing with people. “Installation Specialist” might just as well have used “System Programmer” or another term.

  • There’s definitely a reason to be optimistic about being a person that does that Installation Specialist or System Programmer type of work:

  • There will always be a need for someone to know how systems are put together, how they are configured, how they are upgraded, and how they are serviced…how to just get things to run.

    • And other people that will want them to run faster and have all resources they need for whatever is thrown at them. Performance Specialists and Installation Specialists complement each other.
  • Consider the new function adoption needs too. There are so many “new” functions that are just waiting to be used.

    • We could put new functions into two catagories:

      1. Make life easiesr: Health Checker, z/OSMF, SMP/E for service retrieval

      2. Enable other kinds of workload: python, jupyter notebooks, docker containers

    • A good Installation Specialist would shine in identifying and enabling new function.

      • Now more than ever, with Continuous Delivery across the entire z/OS stack.

      • Beyond “change the process of upgrading”, into “driving new function usage”.

      • Although we know that some customers are just trying to stay current. Merely upgrading could bring new function activation out of the box, like SRB on z15.

    • A really good installation specialist would take information they have and use it in different ways.

      • Looking at the SMP/E CSIs with a simple program in C, for example, to find fixes installed between two dates.

      • Associating that with an incident that just happened, by narrowing it down to a specific element.

      • Using z/OSMF to do a cross-GLOBAL zone query capability, for seeing if a PE was present within the entire enterprise quickly.

      • Knowing what the right tool is for the right need.

    • Knowing parmilb really well. Removing unused members and statements. Getting away from hard-coding defaults – which can be hard, but sometimes can be easier (because some components tell you if you are using defaults).

      • Using the DISPLAY command to immediately find necessary information.

      • Knowing the z/OS UNIX health check that can compare active values with the hardened used parmlib member.

    • Researching End Of Service got immensely easier with z/OSMF.

    • Looking into the history of the systems, with evidence of merging two shops.

      • Could leave around two “sets” of parmlibs, proclibs. Which might be hard to track, depending on what you have such as ISPF statistics or comments. Change management systems can help.

      • Might see LPAR names and CICS region naming conventions

    • Might modern tools such as z/OSMF provide an opportunity to rework the way things are done?

      • Yes, but often more function might be needed to completely replace an existing tool…or not.
    • You can’t be better by only doing things yourself, no one can know everything. You’ve got to work with others who are specialists in their own area.

      • Performance folks often have to work with System Programmers, for instance. Storage, Security, Networking, Applications the list goes on.

      • Examples are with zBNA, memory and zIIP controls for zCX, and estate planning.

    • Use your co-workers to learn from. And teach then what you know too. And in forums like IBM-MAIN, conferences, user groups.

    • Last but not least, learn how to teach yourself. Know where to find answers (and that doesn’t mean asking people!). Learn how to try out something on a sandbox.

Performance – Capture Ratio

  • Our special guest is Miroslava Barahona Rossi, a technical expert who works with large Brasilian customers.

  • Capture Ratio – a ratio of workload CPU to system-level CPU as a percentage. Or, excluding operating system work from productive work.

    • RMF SMF 70-1 versus SMF 72-3
      • 70-1 is CPU at system and machine level
      • 72-3 is workload / service class / report class level
      • Not in an RMF report
  • Why isn’t the capture ratio 100%?

    • There wouldn’t be a fair way to attribute some kinds of CPU. For example I/O Interrupt handling.
  • Why do we care about capture ratio?

    • Commercial considerations when billing for uncaptured cycles. You might worry something is wrong if the capture ratio is low.

    • Might be an opportunity for tuning if below, say, 80%

  • What is a reasonable value?

    • Usually 80 – 90. Seeing more like 85 – 95 these days. It has been improved because more of I/O-related CPU is captured.

    • People worry about low capture ratios.

    • Also work is less I/O intensive, for example, because we buffer better

  • zIIP generally higher than GCP

  • Do we calculate blended GCP and zIIP? Yes, but also zIIP separately from GCP.

  • Why might a capture ratio be low?

    • Common: Low utilisation, Paging, High I/O rate.

    • Less common: Inefficient ACS routines, Fragmented storage pools, Account code verification, Affinity processing, Long internal queues, SLIP processing, GTF

  • Experiment correlating capture ratio with myriad things

    • One customer set of data with z13, where capture ratio varied significantly.

      • In spreadsheet calculated correlation between capture ratio and various other metrics. Used =CORREL(range, range) Excel function.

      • Good correlation is > 85%

      • Eliminate potential causes, one by one:

        • Paging, SIIS poor correlation

        • Low utilisation strong correlation

      • It has nothing much to do with machine generation. The same customer – from z9 to z13 – always had a low capture ratio.

        • It got a little bit better with newer z/OS releases

        • Workload mix? Batch versus transactional

      • All the other potential causes eliminated

      • Turned out to be LPAR complexity

        • 10 LPARs on 3 engines. Logical: Physical was 7:1, which was rather extreme.

        • Nothing much can be done about it – could merge LPARs. Architectural decision.

    • Lesson: worth doing it with your own data. Experiment with excluding data and various potential correlations.

  • Correlation is not causation. Look for the real mechanism, and eliminate causes one by one. Probably start with paging and low utilisation

  • Other kinds of Capture Ratio

    • Coupling Facility CPU: Always 100%, at low traffic CPU per request is inflated.

    • For CICS, SMF 30 versus SMF 110 Monitor Trace: Difference is management of the region on behalf of the transactions.

    • Think of a CICS region as running a small operating system. Not a scalable thing to record SMF 110 so generally this capture ratio is not tracked.

  • Summary

    • Don’t be upset if you get a capture ratio substantially lower than 100%. That’s normal.

    • Understand your normal. Be aware of the relationship of your normal to everybody else’s. But, be careful when making that comparison as it is very workload dependent.

    • Understand your data and causes. See if you can find a way of improving it. Keep track of the capture ratio over the months and years.

Topics – Hello Trello

  • Trello is based on the Kanban idea: boards, lists, and cards. Cards contain the data in paragraphs, checklist, pictures, etc.

  • Can move cards between lists, by dragging.

  • Templates are handy, and are used in creaing our podcast.

  • Power-ups add function. A popular one is Butler. Paying for them might be a key consideration.

  • Multiplatform plus web. Provided by Atlassian, which you might know from Confluence wiki software.

    • Atlassian also make Jira, whch is an Agile Project Management Tool.
  • Why are we talking about Trello?

    • We moved to it for high level podcast planning

      • One list per episode. List’s picture is what becomes the cover art.

      • Each card is a topic, except first one in the list is our checklist.

    • Template used for hatching new future episodes.

      • But, we still outlining the topic with iThoughts
    • We move cards around sometimes between episodes.

      • Recently more than ever, as we switched Episode 28 and 29, due to the z/OS V2.5 Preview Announce.

      • Right now planning two episodes at once.

  • Marna uses it to organise daily work, with personal workload management and calendaring. But it is not a meetings calendar.

    • Probably with Jira will be more useful than ever. We’ll see.
  • Martin uses it for team engagements, with four lists: Potential, Active, Dormant, Completed.

    • Engagement moves between lists as it progresses

    • Debate between one board per engagement and one for everything. Went with one board for everything because otherwise micromanagement sets in.

    • Github projects, which are somewhat dormant now, because of…

    • Interoperability

      • Github issues vs OmniFocus tasks vs Trello Cards. There is a Trello power up for Github, for issues, branches, commits, and pulls requests mapped into cards.

      • However, it is quite fragile , as we are not sure changes in state reliably reflected.

    • Three-legged stool is Martin’s problem, as he uses three tools to keep work in sync. Fragility in automation would be anybody’s problem.

      • iOS Shortcuts is a well built out model. For example, it can create Treello cards and retrieve lists and cards.

        • Might be a way to keep the 3 above in sync
    • IFTTT is used by Marna for automation, and Martin uses automation that sends a Notification when someone updates one of the team’s cards.

      • Martin uses Pushcut on iOS – as we mentioned in Episode 27 Topics

      • Trello provides a IFTTT Trigger, or Pushcut provides a Notification service, which can kick off a shortcut also.

      • Encountered some issues: each list needed its own IFTTT applet. Can’t duplicate applets in IFTTT so it’s a pain to track multiple Trello lists, even within a single board.

    • Automation might be a better alternative to Power-ups, as you can build them yourself.

  • Reflections:

    • Marna likes Trello. She uses it to be more productive, but would like a couple more function which might be addded as it becomes more popular.

    • Martin likes Trello too, but with reservations.

      • Dragging cards around seems a little silly. There are more compact and richer ways of representing such things.

      • A bit of a waste of screen real estate, as cards aren’t that compact. Especially as lists are all in one row. It would be nice to be able to fill more of the screen – with two rows or a more flexible layout.

In closing

  • GSE UK Virtual Conference will be 2 – 12 November 2021.

On the blog

So It Goes

Pi As A Protocol Converter

I wrote about Automation with my Raspberry 3B in Raspberry Pi As An Automation Platform. It’s become a permanent fixture in my office and I’ve given it another task. This blog post is about that task.

Lots of things use JSON(Javascript Object Notation) for communication via HTTP. Unfortunately they don’t all speak the same dialect. Actually:

  1. They do; It’s JSON pure and simple. Though some JSON processors are a little bit picky about things like quotes.
  2. It’s just as well there is the flexibility to express a diverse range of semantics.

This post is about an experiment to convert one form of JSON to another. When I say “experiment” it’s actually something I have in Production – as this post was born from solving a practical problem. I would view it as more a template to borrow from and massively tailor.

The Overall Problem

I have a number of GitHub repositories. With GitHub you can raise an Issue – to ask a question, log a bug, or suggest an enhancement. When that happens to one of my repositories I want to create a task in my task manager – OmniFocus. And I want to do it as automatically as possible.

There isn’t an API to do this directly, so I have to do it via a Shortcuts shortcut (sic) on iOS. To cause the shortcut to fire I use the most excellent PushCut app. PushCut can kick off a shortcut on receipt of a webhook (a custom URL) invocation.

Originally I used an interface between GitHub and IFTTT to cause IFTTT to invoke this webhook. This proved unreliable.

The overall problem, then, is to cause a new GitHub issue to invoke a PushCut webhook with the correct parameters.

The Technical Solution

I emphasised “with the correct parameters” because that’s where this gets interesting:

You can set GitHub up – on a repository-by-repository basis – to invoke a webhook when a new issue is raised. This webhook delivers a comprehensive JSON object.

PushCut webhooks expect JSON – but in a different format to what GitHub provides. And neither of these is tweakable enough to get the job done.

The solution is to create a “protocol converter”, which transforms the JSON from the GitHub format into the PushCut format. This I did with a Raspberry Pi. (I have several already so this was completely free for me to do.)

Implementation consisted of several steps:

  1. Install Apache web server and PHP on the Pi.
  2. Make that web server accessible from the Internet. (I’m not keen on this but I think it’s OK in this case – and it is necessary.)
  3. Write a script.
  4. Install it in the /var/www/html/ directory on the Pi.
  5. Set up the GitHub webhook to invoke the webhook at the address of the script on the Pi.

Only the PHP script is interesting. You can find how to do the rest on the web, so I won’t discuss them here.

PHP Sample Code

The following is just the PHP piece – with the eventual shortcut being a sideshow (so I haven’t included it).

<?PHP

$secret = "<mysecret>";
$json = file_get_contents('php://input');
$data = json_decode($json);

if($data->action == "opened"){
  $issue = $data->issue->number;
  $repository = $data->repository->name;
  $title = $data->issue->title;
  $url = $data->issue->html_url;

  $pushcutURL = "https://api.pushcut.io/" . $secret . "/notifications/New%20GitHub%20Issue%20Via%20Raspberry%20Pi";

  //The JSON data.
  $JSON = array(
      'title' => 'New Issue for ' . $repository,
      'text' => "$issue $title",
      'input' => "$repository $issue $url $title",
  );


  $context = stream_context_create(array(
    'http' => array(
      'method' => 'POST',
      'header' => "Content-Type: application/json\r\n",
      'content' => json_encode($JSON)
    )
  ));

  $response = file_get_contents($pushcutURL, FALSE, $context);
}

?>

But let me explain the more general pieces of the code.

  • Before you could even use it for connecting GitHub to PushCut you would need to replace <mysecret> with your own personal PushCut secret, of course.
  • $json = file_get_contents('php://input'); stores in a variable the JSON sent with the webhook. Let’s call this the “inbound JSON”.
  • The JSON gets decoded into a PHP data structure with $data = json_decode($json);.
  • The rest of the code only gets executed if $data->action is “opened” – as this code is only handling Open events for issues.
  • The line $pushcutURL = "https://api.pushcut.io/" . $secret . "/notifications/New%20GitHub%20Issue%20Via%20Raspberry%20Pi"; is composing the URL for the PushCut webhook. In particular note the notification name “New GitHub Issue Via Raspberry Pi” is percent encoded.
  • The outbound JSON has to be created using elements of the inbound JSON, plus some things PushCut wants – such as a title to display in a notification. In particular the value “input” is set to contain the repository name, the issue number, the original Issue’s URL, and the issue’s title. All except the last are single-word entities. If you are adapting this idea you need to make up your own convention.
  • The $context = and $response = lines are where the PushCut webhook is actually invoked.

As I said, treat the above as a template, with the general idea being that the PHP code can translate the JSON it’s invoked with into a form another service can use, and then call that service.

Conclusion

It was very straightforward to write a JSON converter in PHP. You could do this for any JSON conversion – which is actually why I thought it worthwhile to write it up.

I would also note you could do exactly the same in other software stacks, in particular Node.js. I will leave that one as an exercise for the interested reader. I don’t know whether that would be faster or easier for most people.

On the question of “faster” my need was “low volume” so I didn’t much care about speed. It was plenty fast enough for my needs – being almost instant – and very reliable.

One other thought: My example is JSON but it needn’t be. There need not even be an inbound or outbound payload. The idea of using a web server on a Pi to do translation is what I wanted to get across – with a little, not terribly difficult, sample code.

Mainframe Performance Topics Podcast Episode 28 “The Preview That We Do Anew”

(Originally posted 2 March, 2021.)

It’s unusual for us to publish a podcast episode with a specific deadline in mind. But we thought the z/OS 2.5 Preview announcement was something we could contribute to. So, here we are.

I also wanted to talk about some Open Source projects I’ve been contributing to. So that’s in there.

And it was nice to have Nick on to talk about zCX.

Lengthwise, it’s a “bumper edition”… 🙂

Episode 28 “The Preview That We Do Anew” long show notes.

  • This episode is about several of the z/OS V2.5 new functions, which were recently announced, for both the Mainframe and Performance topics. Our Topics topic is on Martin’s Open Source tool filterCSV.

  • We have a guest for Performance: Nick Matsakis, z/OS Development, IBM Poughkeepsie.

  • Many of the enhancements you’ll see in the z/OS V2.5 Preview were provided on earlier z/OS releases via Continuous Delivery PTFs. The APARs are provided in the announce.

What’s New

Mainframe – Selected z/OS V2.5 enhancements

  • Many of the enhancements you’ll see in the z/OS V2.5 Preview were provided on earlier z/OS releases via Continuous Delivery PTFs. The APARs are provided in the announce.

  • We’ve divided up the Mainframe V2.5 items into two sections: installation and non-installation.

z/OS V2.5 Installation enhancements.

  • IBM will have z/OS installable with z/OSMF, in a portable software instance format!

  • z/OS V2.4 will not be installable with z/OSMF, and z/OS V2.4 driving system requirements remain the same.

  • z/OS V2.5 will be installable via z/OSMF, so that is a big driving system change.

    • However, there is a small window when z/OS V2.4 and z/OS V.5 are concurrently orderable in which z/OS V2.5 will have the same driving system requirements as z/OS V2.4. That overlapping window when z/OS V2.5 is planned to be available via both the old (ISPF CustomPac Dialog) and new (z/OSMF format) is September 2021 through January 2022.

    • After that window, be aware! When z/OS V2.5 is the only z/OS orderable release – at that time, all IBM ServerPac will have to be installed with z/OSMF.

    • All means CICS, Db2, IMS, MQ, and z/OS and all the program products.

    • To be prepared today for this change:

      • Get z/OSMF up and running on their driving system.

      • Learn z/OSMF Software Management (which is very intuitive and try to install a portable software instance from this website.

    • This is a big step forward in the z/OS installation strategy that IBM and all the leading software vendors have been working years on.

      • John Eells came to this very podcast in Episode 9 to talk about it.
    • CICS, Db2, and IMS are already installable with a z/OSMF ServerPac. You can try those out right now.

    • CBPDO will remain an option, instead of ServerPac. But it much harder to install.

      • ServerPac is much easier, and a z/OSMF ServerPac is easiest of all.

z/OS V2.5 Non-installation enhancements.

  • Notification of availability of TCP/IP extended services

    • For many operational tasks and applications that depend on z/OS TCP/IP communication services the current message is insufficient

    • New ENF event intended to enable applications with dependencies on TCP/IP extended services to initialise faster

  • Predictive Failure Analysis (PFA) has more checks

    • For above the bar private storage exhaustion, JES2 resource exhaustion, and performance degradation of key address spaces.
  • Workload Manager (WLM) batch initiator management takes into account availability of zIIP capacity

    • Works most effectively when customer has separate service classes for mostly-zIIP and mostly-GCP jobs

    • Catalog and IDCAMS enhancements

      • Catalog Address Space (CAS) restart functions are enhanced to allow you to change the Master Catalog without IPL

      • IDCAMS DELETE mask takes TEST and EXCLUDE. TEST to see what would be deleted using the mask. EXCLUDE is further filtering – beyond the mask.

      • IDCAMS REPRO moves I/O buffers above line. This will help avoid 878 “Insufficient Virtual Storage” ABENDs.
        We think think this might allow more buffers, and multitasking in one address space.

    • New RMF Concept for CF data gathering

      • There is a a new option, not the default, to optimize CF hardware data collection to one system. Remember SMF 74.4 has two types of data: system specific, and common to all systems.

      • This is designed to reduce overhead on n-1 systems.

  • RMF has been restructured, but all the functions are still intact. z/OS V2.5 RMF is still a priced feature.

    • A new z/OS V2.5 base element called “Data Gatherer” provides basic data gathering and is available to all, whether you’ve bought RMF or not. It will cut some SMF records.

    • There is a new z/OS V2.5 price feature called “Advanced Data Gatherer” which all RMF users are entitled to.

    • Marna is mentioning this, as the restructure has brought about some customization changes you’ll need to do one time for parmlib with APF and linklist.

  • More quite diverse RACF health checks for Pass tickets, subsystem address spaces active, and sysplex configuration.

Performance – z/OS V2.5 zCX enhancements.

  • Our special guest is Nick Matsakis, who is a performance specialist in z/OS Development, and has worked on several components in the BCP (GRS, XCF/XES, …). Martin and Nick have known each other for many years, recalling Nick’s assignment in Hursley, UK.

  • zCX is a base element new in z/OS V2.4, and requires a z14. It allows you to run Docker Container applications that run on Linux on Z on z/OS.

  • zCX is important for co-locating Linux on Z containers with z/OS. You can look at zCX like an appliance, which are z/OS address spaces.

  • Popular use cases can be found here and in the Redbook here. Another helpful source is Ready for the Cloud with IBM zCX.

    • Nick mentions the use cases of adding microservices to existing z/OS applications being served by a zCX container, and the MQ Concentrator for reducing z/OS CPU costs, by running it on zCX. Another is Aspera whcih good for streaming-type workloads.
  • zIIP eligibility enhancements

    • Context switching reduction was delivered to typically expect about 95% offload to zIIP.
  • Memory enhancements

    • Originially it was all 4K fixed pages. New enhancements include support for 1 MB and 2 GB large pages (still fixed) for backing guests.

      • Increases efficiency of memory management, with better performance is expected, mainly based on TLB miss reduction.

      • In house, Nick saw .25% up to about 6-12%, depending on what you are running.

    • Note need to set LFAREA as discussed in Episode 26.

      • LFAREA as of z/OS V2.3 is the maximum number of fixed 1M pages allowed on system. 2GB hasn’t changed.

      • zCX configuration allows you to say which page sizes you’d like to try. Plan for using 2GB.

    • Guest memory is planned to be configured up to 1 TB.

      • zCX uses fixed storage so the practical limit may be lower. The limit used to be much lower, at about 100 GB.

      • Now we support up to 1000 containers in a zCX address space. Capacity is increasing.

  • Another relief is in Disk space limits

    • The number of data and swap disks per appliance is planned to be increased to as many as 245. This is intended to enable a single zCX to address more data at one time.

    • Point is you can run more and larger containers.

  • Instrumentation enhanced

    • Monitor and log zCX resource usage of the root disk, guest memory, swap disk, and data disks in the servers job log.

    • zCX resource shortage z/OS alerts are proactive alerts that are sent to the z/OS system log (SYSLOG) or operations log (OPERLOG) to improve monitoring and automated operations. The server monitors used memory, root disk space, user data disk space, and swap space in the zCX instance periodically and issues messages to the zCX joblog and operator console when the usage rises to 50%, 70%, and 85% utilization. When returning below 50%, an information message is issued

    • But still nothing in SMF to look inside a zCX address space

      • There is Docker-specific instrumentation that can provide that for you.
  • SIMD (or Vector)

    • SIMD is a performance feature, and can be used for analytics.

    • Some containers don’t check if they are running on hardware where SIMD is available.

  • Note that most of what’s in the z/OS 2.5 Preview for zCX is rolled back to z/OS 2.4 with APARs.

  • From this, we can conclude zCX wasn’t a “one and done”.

    • z/OS 2.5 might be a good time to try it. There is a 90-day trial period, as there is a cost for it. But, why wait for 2.5?
  • Nick’s presentation (with Mike Fitzpatrick) can be downloaded here.

Topics – filterCSV and tree manipulation

  • Trees are nodes that have zero to many children. You can have a leaf node (zero children), or a non-leaf node (one or more children).

    • Navigation can be recursive or iterative, which makes it nice for programming.
  • Mindmapping leads to trees. Thinking of z/OS: Sysplex -> System -> Db2 -> Connected CICS leads to trees. Also, in Db2 DDF Analysis Tool we show DDF connections as a tree.

  • Structurally, each node is a data structure with fields such as readable names. Each node has pointers to its children and maybe its parent. This gives it its “topology”, and tree levels.

  • iThoughts is a mind mapping tool, and displays a mind map as a tree. Nodes can have colours and shapes, and many other attributes besides.

    • iThoughts runs on Windows, iOS, iPadOS and macOS.

    • Exports and imports CSV files, with a tree topology and also node attributes, such as shape, colour, text, notes.

    • Has very little automation of its own. But crucially you can mangle the CSV file outside of iThoughts, which is what filterCSV does.

  • filterCSV is a python open source program that manipulates iThoughts CSV files.

    • It could address the automation problem, as it mangling automatically.

    • An example: automatically colours the blobs based on patterns (regular expressions).

      • Colouring CICS regions according to naming conventions
  • fiterCSV started simple, and Martin has kept adding function. Most recently find and replace. As it’s an open source project, contributions are welcomed.

On the blog

So It Goes

SMF 70-1 – Where Some More Of The Wild Things Are

(First posted February 21, 2021)

As I recall, the last time I wrote about SMF 70-1 records in detail was Engineering – Part Two – Non-Integer Weights Are A Thing. Even if it weren’t, no matter – as I’d like you to take a look at it. The reason is to reacquaint you with ERBSCAN and ERBSHOW – two invaluable tools when understanding the detailed structure and contents of an SMF record. (Really, an RMF SMF record.) And it does introduce you to the concept of a Logical Processor Data Section.

This post is another dive into detailed record structure. (The first attempt at the last sentence had the word “derailed”; That might tell me something.) 🙂

In most cases a system cuts a single SMF 70 Subtype 1 record per interval. But this post is not about those cases.

The Structure Of A 70-1 Record

SMF 70-1 is one of the more complex record subtypes – and one of the most valuable.

Here is a synopsis of the layout:

What is in blue(ish) relates to the system cutting the record. The other colours are for other LPARs.

At its simplest, a single 70-1 record represents all the LPARs on the machine. But it’s not always that simple.

Let me point out some key features.

  • The CPU Data Sections are 1 per processor for the system that cut the record. In this example there are three – so this is a 3-way.
  • zIIPs and GCPs are treated the same, but they are individually identifiable as zIIP or GCP.
  • There is one Partition Data Section per logical partition on the machine, plus 1 called “*PHYSCAL”.
  • There is one Logical Processor Data Section per logical processor, plus 1 per physical processor.

The colour coding is useful here. Let’s divide it into two cases:

  • The processors for the cutting LPAR.
  • The processors for the other LPARs.

For what we’ll call “this LPAR”, there are CPU Data Sections for each processor, plus a Partition Data Section, Logical Core Data Sections, and Logical Processor Data Sections.

For each of what we’ll call “other LPARs” there are just the Partition Data Section and its Logical Processor Data Sections.

You’ll notice that the blue Partition Data Section and its Logical Processor Data Sections are the first in their respective categories. I’ve always seen it to be the case that this LPAR’s sections come first. I assume PR/SM returns them in that sequence – though I don’t know if this is an architectural requirement.

The relationship between Partition Data Sections and the corresponding Logical Processor Data Sections is straightforward: Each Partition Data Section points to the first Logical Processor Data Section for that LPAR and has a count of the number of such sections. The pointer here is an index into the set of Logical Processor Data Sections, where the first has an index of 0. (ERBSHOW calls it “#1”.)

(A deactivated LPAR has an (irrelevant) index and a count of 0 – and that’s how my code detects them.)

So far so good, and quite complex.

How Do You Get Multiple 70-1 Records In An Interval?

Obviously each system cuts at least one record per interval – if 70-1 is enabled. So this is not about that.

In recent years the number of physical processors in a machine and logical processors per LPAR have both increased. I regard these as technological trends, driven mainly by capacity. At the same time there is an architectural trend towards more LPARs per machine.

Here are the sizes of the relevant sections – as of z/OS 2.4:

  • CPU Data Section: 92 bytes.
  • Partition Data Section: 80 bytes.
  • Logical Processor Data Section: 88 bytes.
  • Logical Core Data Section: 16 bytes.

These might not seem like large numbers but you can probably see where this is heading.

An SMF record can be up to about 32KB in size. You can only fit a few hundred Logical Processor Data Sections into 32KB, and that number might be significantly truncated if this LPAR has a lot of processors.

All of this was easy with machines with few logical processors (and still is).

But let’s take the case of a 100-way LPAR (whatever we think of that.) Its own sections are (92 + 88 + 16) x 100 or 19.6KB plus some other sections. So at least 20KB. And that’s before we consider sections for other LPARs.

Now let’s ignore this LPAR and consider the case of 50 1-way LPARs. There the PR/SM related sections add up to (80 + 88) x 50 = 8.4KB. Of course it’s extremely unlikely many would be 1-way LPARs, so the numbers are realistically much higher than that.

By the way, for a logical processor to count in any of this it just has to be defined. It might well have zero Online Time. It might well be a Parked Vertical Low. It doesn’t matter. The Logical Processor Data Sections are still there.

So, to exceed the capacity of a 32KB 70-1 SMF record we just have to have a lot of logical processors across all the LPARs in the machine, whether in this system or other LPARs. And an exacerbating factor is if these logical processors are across lots of LPARs.

What Does RMF Do If The Data Won’t Fit One Record?

I’ve seen a lot of SMF 70-1 records in my time, and spent a lot of time with ERBSCAN and ERBSHOW examining them at the byte (and sometimes bit) level.

I do know RMF takes great care to adjust how it lays out the records.

Firstly, to state the obvious, RMF doesn’t throw away data; All the sections exist in some record in the sequence.

Secondly, RMF keeps each LPAR’s sections together. So the Partition Data Section and its related Logical Processor Data Sections are all in the same record. This is obviously the right thing to do, otherwise the index and count for the Logical Processor Data Sections could break.

Thirdly, and this is something I hadn’t figured out before, only one record in the sequence contains the CPU Data Sections. (I think also the Logical Core Data Sections.)

How Should I Handle The Multi-Record Case?

Let me assume you’re actually going to have to decide how to deal with this.

There are two basic strategies:

  1. Assemble the records in the sequence into one record in memory.
  2. Handle each record separately.

Our code, rightly in my opinion, uses Strategy 2. Strategy 1 has some issues:

  • Collecting information from multiple records, and timing the processing of the composite data.
  • Fixing up things like the index of the Logical Processor Data Sections.

Probably some tools do this, but it’s fiddly.

So we process each LPAR separately, thanks to all the information being in one record. And so we can process each record separately.

Reality Check

If you have only one 70-1 record per interval per cutting system none of the above is necessary to know. But I think it’s interesting.

If you rely on some tooling to process the records – and most sensible people do – you probably don’t care about their structure. Certainly, the RMF Postprocessor gets this right for you in the CPU Activity Report (and Partition Data Report sub-report).

So, I’ve probably lost most of my audience at this point. 🙂 If not, you’re on my wavelength – which isn’t crowded. 🙂 (This is the second “on my wavelength” joke in my arsenal, the other being open to misinterpretation.)

I like to get down into the physical records for a number of reasons, not least of which are:

  • When things break I need to fix them.
  • It cements my understanding of how what they describe works.

Oh, and it’s fun, too.

Final Thoughts

This post was inspired by a situation that required yet more adjusting of our code. Sometimes life’s that way. In particular, a number of LPARs were missing – because our Assembler code threw away any record with no CPU Data Sections. (This is inherited code but it’s quite possibly a problem I introduced some time in the past 20 years.)

I should point out that – for simplicity – I’ve ignored IFLs, (now very rare) zAAPs, and ICFs. They are treated exactly the same as GCPs and zIIPs. Of course the record-cutting LPAR won’t have IFLs or ICFs.

I have a quite old presentation “Much Ado About CPU”. Maybe I should write one with “Part Two” tacked on. Or maybe “Renewed” – if it’s not such a radical departure. But then I’ve done quite a bit of presentation writing on the general topic of CPU over recent years.

Raspberry Pi As An Automation Platform

(First posted 14 February, 2021)

Recently I bought a touch screen and attached it to one of my older Raspberry Pi’s (a 3B). In fact the Pi is attached to the back of the touch screen and has some very short cables. This is only a 7 inch diagonal screen but it’s more than enough for the experiment I’m going to describe.

Some of you will recognise I’ve used similar things – Stream Decks and Metagrid – in the past. Most recently I showed a Metagrid screenshot in Automating Microsoft Excel Some More.

So it will be no surprise that I’m experimenting with push-button automation again. But this time I’m experimenting with something that is a little more outside the Apple ecosystem. (Yes, Stream Deck can be used with Windows PCs but that isn’t how I use it.)

While both Stream Deck and Metagrid are commercially available push-button automation products, I wanted to see what I could do on my own. I got far enough that I think I have something worth sharing with other people who are into automation.

What I Built

The following isn’t going to be the prettiest user interface in the world but it certainly gets the job done:

Here are what the buttons in the code sample do (for me):

  • The top row of buttons allows me to turn the landing lights on and off.
  • The middle row does the same but for my home office.
  • The bottom row has two dissimilar functions:

This is quite a diverse set of functions and I want to show you how they were built.

By the way the screen grab was done with the PrtSc (“Print Screen’) key and transferred to my iPad using Secure Shellfish.

I used this article to figure out how to auto start the Python code when the Pi boots. It doesn’t get me into “kiosk mode” but then I didn’t really want it to.

Python Tkinter User Interface

What you see on the screen is a very simple Python program using the Tkinter graphical user interface library

The following is the code I wrote. If you just copy and paste it it won’t run. There are two modifications you’d need to make:

  • You need to supply your IFTTT maker key – enclosed in quotes.
  • You need to supply the URL to your Keyboard Maestro macro – for each macro.

If you don’t have IFTTT you could set IFTTTbuttonSpecs to an empty list. Similarly, if you don’t have any externally callable Keyboard Maestro macros (or externally callable URLs) you would want to make URLButtonSpecs an empty list.

You can, of course, rearrange buttons by changing their row and column numbers.

#!/usr/bin/env python3
import tkinter as tk
import tkinter.font as tkf
from tkinter import messagebox
import urllib.request
import urllib.parse
import os


class Application(tk.Frame):
    def __init__(self, master=None):
        tk.Frame.__init__(self, master)
        self.grid()
        self.createWidgets()
        self.IFTTTkey = <Insert your IFTTT Key Here>

    def createWidgets(self):
        self.bigFont = tkf.Font(family="Helvetica", size=32)
        IFTTTbuttonSpecs = [
            ("Landing", True, "Landing\nLight On",0,0),
            ("Landing", False, "Landing\nLight Off",0,1),
            ("Office", True, "Office\nLight On",1,0),
            ("Office", False, "Office\nLight Off",1,1),
        ]

        URLButtonSpecs = [
            ("Say Hello\nKM", <Insert your Keyboard Maestro macro's URL here>,2,0)
        ]

        localCommandButtonSpecs = [
            ("Reboot\nPi","sudo reboot",2,1),
        ]

        buttons = []

        # IFTTT Buttons
        for (lightName, lightState, buttonLabel, buttonRow, buttonColumn) in IFTTTbuttonSpecs:
            # Create a button
            button = tk.Button(
                self,
                text=buttonLabel,
                command=lambda lightName1=lightName, lightState1=lightState: self.light(
                    lightName1, lightState1
                ),
                font=self.bigFont,
            )

            button.grid(row=buttonRow, column=buttonColumn)

            buttons.append(button)

        for (buttonLabel, url, buttonRow, buttonColumn) in URLButtonSpecs:
            # Create a button
            button = tk.Button(
                self,
                text=buttonLabel,
                command = lambda url1 = url : self.doURL(
                    url1
                ),
                font=self.bigFont,
            )

            button.grid(row=buttonRow, column=buttonColumn)

            buttons.append(button)

        for (buttonLabel, cmd, buttonRow, buttonColumn) in localCommandButtonSpecs:
            # Create a button
            button = tk.Button(
                self,
                text=buttonLabel,
                command = lambda cmd1 = cmd : self.doLocalCommand(
                    cmd1
                ),
                font=self.bigFont,
            )

            button.grid(row=buttonRow, column=buttonColumn)

            buttons.append(button)

    def light(self, room, on):
        if on:
            url = (
                "https://maker.ifttt.com/trigger/"
                + urllib.parse.quote("Turn " + room + " Light On")
                + "/with/key/"
                + self.IFTTTkey
            )
        else:
            url = (
                "https://maker.ifttt.com/trigger/"
                + urllib.parse.quote("Turn " + room + " Light Off")
                + "/with/key/"
                + self.IFTTTkey
            )
        opening = urllib.request.urlopen(url)
        data = opening.read()

    def doLocalCommand(self, cmd):
        os.system(cmd)

    def doURL(self, url):
        opening = urllib.request.urlopen(url)
        data = opening.read()

app = Application()
app.master.title("Control Panel")
app.mainloop()

I’ve structured the above code to be extensible. You could easily change any of the three types of action, or indeed add your own.

Hue Light Bulbs And IFTTT

Philips Hue light bulbs are smart bulbs that you can turn on and off with automation. There are others, too, but these are the ones I happen to have in the house, along with a hub. I usually control them with Siri to one of the HomePods in the house or Alexa on various Amazon Echo / Show devices.

IFTTT is a web-based automation system. You create applets with two components:

  1. A trigger.
  2. An action.

When the trigger is fired the action happens. In my experiment a webhook URL can be set up to trigger the Hue Bulb action. For each of the four buttons I have an applet. Two bulbs x on and off.

I would observe a number of things I don’t much like, though none of them stopped me for long:

  • The latency is a few seconds – but then I usually don’t need a light to come on or go off quicker than that.
  • You can’t parameterise the applet to the extent I would like, more or less forcing me to create one applet per button.
  • You can’t clone an IFTTT applet. So you have to create them by hand.

Still, as I said, it works well enough for me. And I will be keeping these buttons.

Remotely Invoking Keyboard Maestro

This one is a little more sketchy, but only in terms of what I’ll do with it. You’ll notice I have “Hello World”. The sorts of things I might get it to do are:

  • Opening all the apps I need to write a blog post. Or to edit a certain presentation.
  • Rearranging the windows on my screen.

Keyboard Maestro is incredibly flexible in what it allows you to do.

To be able to call a macro you need to know two things:

  1. Its UUID.
  2. The bonjour name (or IP address) of the Mac running Keyboard Maestro.

You also need to have enabled the Web server in the Web Server tab of Keyboard Maestro’s Preferences dialog.

To construct the URL you need to assemble the pieces something like this:

http://<server name>:4490/action.html?macro=<macro UUID>

The UUID can be obtained while editing the macro using menu item “Copy UUID” under “Copy As” from the “Edit” menu.

It’s a little complicated but it runs quickly and can do a lot in terms of controlling a Mac.

Rebooting The Raspberry Pi

This one is the simplest of all – and the quickest. Python has the os.system() function. You pass a command string to it and it executes the command. In my case the command was sudo reboot.

It’s not surprising this is quick to kick off – as this is a local command invocation.

After I copied the Python code into this blog post I decided I want a companion button to shut down the Pi – for cleaning purposes. This would be trivial to add.

Conclusion

This is quite a good use of a semi-redundant Raspberry Pi – even if I spent more on the touch screen than I did on the Pi in the first place. And it was ever thus. 🙂

The diversity of the functions is deliberate. I’m sure many people can think of other types of things to kick off from a push button interface on a Raspberry Pi with a touch screen. Have at it! Specifically, feel free to take the Python code and improve on it – and tell me how I’m doing it all wrong. 🙂 Have fun!

I, for one, intend to keep experimenting with this. And somebody makes a 15” touch screen for the Pi… 🙂

Coupling Facility Structure Performance – A Multi-System View

It’s been quite a while since I last wrote about Coupling Facility performance. Indeed it’s a long time since I presented on it – so I might have to update my Parallel Sysplex Performance presentation soon.

(For reference, that last post on CF Performance was Maskerade in early 2018.)

In the past I’ve talked about how a single system’s service time to a single structure behaves with increasing load. This graphing has been pretty useful. Here’s an example.

This is from a system we’ll call SYS1. It is ICA-SR connected. This means a real cable, over less than 150m distance. It’s to a single structure in Coupling Facility CF – DFHXQLS_POOLM02, which is a list structure. Actually a CICS Temporary Storage sharing pool – “POOLM02”.

From this graph we can see that the service time for a request stays pretty constant at around 7.5μs. Also that the Coupling Facility CPU time per request is almost all of it.

I have another stock graph, actually a pair of them, which show a shift average view of all the systems’ performance with a single structure. This is pretty nice, too.

Here’s the Rate Graph across the entire sysplex.

Here we see SYS1 and it’s counterparts in the Sysplex – SYS2, SYS3, and SYS4.

(Note to self: They really are numbered that way.)

We can see that in general the traffic is mostly from SYS1 and SYS2, and almost none from SYS3. I would call that architecturally significant.

We can also see that there is no asynchronous traffic to this structure from any LPAR.

And here’s the Service Time graph.

You can see that the two IC-Peer-connected LPARs have better service times than the two ICA-SR-connected LPARs. This is reasonable given that IC Peer links are simulated by PR/SM and so unaffected by the speed of light or distance. Again, the statement has to be qualified by in general.

But the graphs you’ve seen so far leave a lot of questions unanswered.

So, for a long time I’ve wanted to do something that combined the two approaches: Performance With Increasing Load, and Differences Between Systems.

I wanted to get beyond the single-system view of scalability. I usually put a number of systems’ scalability graphs on a single slide but

  • The graphs end up smaller than I would like.
  • This doesn’t scale beyond four systems.

The static multi-system graphing is fine but it really doesn’t tell the full story.

Well, now I have it in my kitbag. I’m sharing a new approach with you – because I think you’ll find it interesting and useful.

The New Approach

How about plotting all the systems’ service times versus rates on one graph? It sounds obvious – now I mention it.

Well, let’s see how it works out. Here’s a nice example:

Again we have the same four systems and the same CF structure. Here’s what I conclude when I look at this:

  • SYS2 and SYS4 have consistently better service times – across the entire operating range – than SYS1 and SYS3. This shows the same IC Peer vs ICA-SR dynamic as we saw before.
  • SYS3 service times are worse than those of the other 3 – and again we see its rate top out considerably lower than those of the other 3.
  • SYS2 service times are always worse than SYS4’s. They happen to share the same machine and SYS2 is a much bigger LPAR than SYS4, actually spanning more than 1 drawer. That might have something to do with it.

Conclusion

Coupling Facility service times and traffic remain key aspects of tuning Parallel Sysplex implementations. The approach of “understand what happens with load” also remains valid.

The new piece – combining the service times for all LPARs sharing a structure none graph – looks like the best way of summarising such behaviours so far.

Of course this graph will evolve. I can already think of two things to do to it:

  • Add the link types into the series legend.
  • Avoid showing systems that don’t have any traffic to the structure (and maybe indicating that in the title).

But, for now, I want to get more experience with using this graph. For example, an even more recent customer has all systems connected to each coupling facility by ICA-SR links. The graphs for that one show similar curves for each system – which is unsurprising. But maybe in that case I would see a difference if the links were of different lengths.

And, as always, if I learn something interesting I’ll let you know.

More On Samples

This post follows on from A Note On Velocity from 2015. Follows on at a respectful distance, I’d say – since it’s been 5 years.

In that post I wrote “But those ideas are for another day or, more likely, another year (it being December now).” This is that other day / year – as this post reports on some of those “left on the table” aspects. For one, I do now project what happens if we include (or exclude) I/O samples.

In a recent customer engagement I did some work on WLM samples for a Batch service. This service class has 2 periods, the first period having an incredibly short 75 service units duration.

  • Period 1 is Importance 4, with a reasonable velocity.
  • Period 2 is Discretionary.

Almost everything ends in Period 2 – so almost all batch work in this shop is running Discretionary i.e. bottom dog without a goal.

As I said in A Note On Velocity, RMF reports attained velocity from Using and Delay samples and these come direct from WLM. Importantly this means you can calculate Velocity without having to sum all the buckets of Using and Delay samples. You won’t, for example, add in I/O Using and I/O Delay samples when you shouldn’t – if you’re calculating velocity from the raw RMF SMF fields (as our code does). I’ll call this calculation using the overall Using and Delay buckets the Headline Velocity Calculation.

I thought this would be useful for figuring out if I/O Priority Management is enabled. In fact there’s a flag for that – at the system level – but if you do the calculation by totting up the buckets you get sensible numbers for both cases: Enabled and Disabled.

I/O Priority Management can be enabled or disabled at the service class level. I don’t definitively see a flag in RMF for this at the service class level but presumably if the headline calculation doesn’t work versus totting up the individual buckets with I/O samples then the Service Class is not subject to I/O Priority Management. And the converse would be true.

Batch Samples

For Batch, the headline calculation is matched by totting up the buckets for Using and Delay, if you include QMPL in the Delay samples tally – because this represents Initiator Delay. This is sensible to include in the velocity calculation as WLM-managed initiators are, as the name suggests, managed according to goal attainment and a delay in being initiated really ought to be part of the calculation.

Equally, though, with JES-managed initiators you could get a delay waiting for an initiator. And WLM isn’t going to do anything about that.

(By the way, SMF 30 – at the address space / job level – has explicit times fields for a job getting started. The most relevant one is SMF30SQT.)

I was reminded in this study that samples where the work is eligible to run on a zIIP but where it actually runs on a GCP are included in Using GCP samples. If you do the maths it works. It’s not really surprising.

This is also a good time to remind you samples aren’t time, except for CPU – which is measured and converted to samples.

An example of where this is relevant is when zIIP speed is different from GCP speed. there are two cases for this:

  • WIth subcapacity GCPs – where the zIIPs are faster than GCPs.
  • With zIIPs running SMT-2 – where zIIP speed is slower than when SMT is not enabled. (It might still be faster than a GCP but it might not be.)

Here, it becomes interesting to think about how you get all the sample types approximately equivalent. I would expect – in the “zIIPs are different speed from GCPs” case there might need to be some use of the (R723NFFI) conversion factor. I wouldn’t, though expect the effective speed of SMT-2 zIIPs to be part of the conversion.

But perhaps I’m overthinking this and perhaps a raw zIIP second is treated the same as a raw GCP second. And both are, of course, different to Using I/O.

Sample Frequency And Sampleable Units

WLM samples Performance Blocks (PBs). These might be 1 per address space or there might be many. CICS regions would be an example of where there are many.

I’m told PBs in a CICS region are not the same as MXT (maximum number of tasks) but could approach it if the workload in the region built up enough. This is different from what I thought.

I tried to calculate MXT from sample counts divided by the sampling interval and didn’t get a sensible estimate. Which is why I asked a few friends. You can imagine that a method of calculating MXT not requiring CICS-specific instrumentation would’ve been valuable.

Conclusion

One thing I should note in this post is that – in my experience – sampling is exact. That is to say, if you add up the samples in the buckets right you get exactly the headline number. Exactness is valuable in that it gives you confidence in your inferences. Inexactness could still leave you wondering.

Most people don’t get into the raw SMF fields but if you do:

  • You can go beyond what eg RMF reports give you.
  • You get a much better feel for how the data (and the reality it describes) actually works.

But, as with the CICS MXT case, you can get unexpected results. I hope you (and I) learn from those.

Automating Microsoft Excel Some More

As I said in Automating Microsoft Excel, I thought I might write some more about automating Excel.

Recall I wrote about it because finding snippets of code to do what you want is difficult. So if I can add to that meagre stockpile on the web, I’m going to.

That other post was about automating graph manipulation. This post is about another aspect of automating Excel.

The Problem I Wanted To Solve

Recently I’ve had several instances where I’ve created a CSV (Comma-Separated Value) file I wanted to import into Excel. That bit’s easy. What made these instances different (and harder) was that I wanted to import them into a bunch of sheets. Think “15 sheets”.

This is a difficult problem because you have to:

  1. Figure out where the break points are. I’m thinking a row with only a single cell as a good start. (I can make my CSV file look like that.)
  2. Load each chunk into a separate new sheet.
  3. Name that sheet according the the value in that single cell.
  4. (Probably) delete any blank rows, or any that are just a cell with (underlining) “=” or “-” values.

I haven’t solved that problem. When I do I’ll be really happy. I expect to in 2021.

The Problem I Actually Solved

Suppose you have 15 sheets. There are two things I want to do, given that:

  • Rapidly move to the first or last sheet.
  • Move the current sheet left or right or to the start or end.

The first is about navigation when the the sheets are in good shape. The second is about getting them that way. (When I manually split a large CSV file the resulting sheets tend not to be in the sequence I want them in.)

As noted in the previous post I’m using the Metagrid app on a USB-attached iPad. Here is what my Metagrid page for Excel currently looks like:

In the blue box are the buttons that kick off the AppleScript scripts in this post. As an aside, note how much space there is around the buttons. One thing I like about Metagrid is you can spread out and not cram everything into a small number of spots.

The Scripts

I’m not going to claim my AppleScript is necessarily the best in the world – but it gets the job done. Unfortunately that’s what AppleScript is like – but if you are able to improve on these I’m all ears eyes.

Move To First Sheet

tell application "Microsoft Excel"
	select worksheet 1 of active workbook
end tell

Move To Last Sheet

tell application "Microsoft Excel"
	select worksheet (entry index of last sheet) of active workbook
end tell

Move Sheet To Start

tell application "Microsoft Excel"
	set mySheet to active sheet
	move mySheet to before sheet 1
end tell

Move Sheet To End

tell application "Microsoft Excel"
	set mySheet to active sheet
	set lastSheet to (entry index of last sheet)
	move mySheet to after sheet lastSheet
end tell

Move Sheet Left

tell application "Microsoft Excel"
	set mySheet to active sheet
	set previousSheet to (entry index of active sheet) - 1
	move mySheet to before sheet previousSheet
end tell

Move Sheet Right

tell application "Microsoft Excel"
	set mySheet to active sheet
	set nextSheet to (entry index of active sheet) + 1
	move mySheet to after sheet nextSheet
end tell

Conclusion

Those snippets of AppleScript look pretty simple. However, each took quite a while to get right. But now they save me time on a frequent basis. And they might save you time.

They are all Mac-based but the model is similar to that in VBA. If you’re a Windows person you can probably replicate them quite readily with VBA.

And perhaps I will get that “all singing, all dancing” Import-A-CSV-Into-Multiple-Sheets automation working. If I do you’ll hear read about it here.

Mainframe Performance Topics Podcast Episode 27 “And Another Thing”


So this is one of our longest episodes yet, but jam full of content which is very us. As usual, finding times when we can both be available to record was tough. What wasn’t difficult was finding material. I can say for myself what we talked about is a set of things I’ve wanted to talk about for a long time.

Anyhow, enjoy! And do keep feedback and “Ask MPT” questions coming.


Episode 27 “And Another Thing” Show Notes – The Full, Unexpurgated, Version


Follow up


  • Additional System Recovery Boost enhancements:

    • Sysplex partitioning recovery: Boosts all surviving systems in the sysplex as they recover and takes on additional workload following the planned or unplanned removal of a system from the sysplex.

    • CF structure recovery: Boosts all systems participating in CF structure recovery processing, including rebuild, duplexing failover, and reduplexing.

    • CF data-sharing member recovery: Boosts all systems participating in recovery following the termination of a CF locking data-sharing member

    • HyperSwap recovery: Boosts all systems participating in a HyperSwap recovery process to recover from the failure of a storage subsystem.

    • Existing functions: image-related boosts for IPL (60 min) and shutdown (30 min)

    • These are different in that they boost multiple systems, rather than the single one that the originally announced SRB would boost. These should be fairly rare occurrences – but really helpful when needed.

  • Two more portable software instances: Db2 and IMS added to Shopz on Aug 20, 2020, in addition to CICS (Dec 2019).


ASK MPT


Martin was asked about which systems in a System Recovery Boost Speed Boost situation get their GCPs sped up to full speed. The answer is it’s only the LPARs participating in a boost period that get their GCPs sped up. For example, on a 6xx model the other LPARs don’t get sped up to 7xx speed.


Mainframe – Trying out Ansible for driving z/OSMF Workflows


  • Marna’s been learning about Ansible and how it can drive z/OSMF work. So far with z/OSMF Workflows, so terminology might not be exactly right from inexperience.

  • Found a driving system for Ansible system (ubuntu distribution Linux running on x86).

  • A lot of installs on this Linux distribution was necessary: python 3.7, ansible, ansible-galaxy collection for ibm.ibm_zos_zosmf .

  • Ansible galaxy is a public collection of modules. Run those modules from a playbook . While steps are called roles, like jobs with steps. Ansible has some sophistication where those roles can be run by different people.

  • Had to learn a little yaml, and understand the sample playbook which came with the collection, changing the playbook for my specific system.

    • This is where helped was necessary from two helpful gracious Poughkeepsie system testers (massive thanks to Marcos Barbieri and Daniel Gisolfi !!)

    • Learning about an Ansible with a staging inventory (a configuration file), andn learning which playbook messages were ok or not ok.

  • Encountered two kinds of problems:

    • Security issues connecting the Linux environment to a large Poughkeepsie System Test z/OSMF environment

      • Required changes to the playbook, and to environment files.
    • Workflow-related: duplicate instances when part of the playbook ran ok, and selecting an automated workflow.

  • Why learn Ansible when you are a z/OS person? Ansible gives us a very nice interface to get to z/OS resources via z/OSMF capabilities with a simple programming interface.

    • Also, if we want to get more people to use z/OS, and they are familiar with Linux, they probably will want to drive work to z/OS with some sort of familiar automation

    • Linux installation of products from a command line, and having to keep looking up command syntax isn’t that fun. Although pretty easy find with google.

    • The products all installed quickly and cleanly&comma; however knowing that the dependencies were right was not obvious. Especially the python and Ansible levels.

  • Ansible Tower as the GUI is helpful, but Marna chose to learn Ansible in pieces from the ground up.

  • As it seems to always be: getting your job done comes down to what your search returns, no matter what you are doing – z/OS or Linux work. Or even programming.


Performance – So You Don’t Think You’re An Architect?


  • Brand new presentation for 2020, builds on the “who says what SMF is for?” slide in “How To Be A Better Performance Specialist”

    • Usually kick off an engagement with an Architecture discussion, which might be hours long!

    • Builds common understanding and trust

    • Techniques in this presentation well honed

  • Presentation talks about 3 layers

    • The Top Layer – Machines, Sysplexes And LPARs

      • For example a diagram showing all the LPARs and their definitions – by pool – for a machine. Together with their sysplexes.

        • Important to answer “what is the role of this LPAR?”

        • Important to understand how far apart machines are

        • Inactive LPARs often give availability clues

        • Driven by RMF CPU and Sysplex SMF

    • The Middle Layer – Workloads And Middleware

      • Spikes: e.g. HSM Daily spike, and actually patterns of utilisation in general

      • Restarts: IPLs, Db2, IMS, CICS Regions

      • Cloning: Things like identical CICS regions

      • Topology: e.g. for CICS what Db2 or MQ a region connects to, from SMF 30 Usage Data Section. Invert that and you have the Db2’s role and what connects to this Db2. Enclave statistics give DDF traffic, Db2 Datasharing group structure

      • Software Levels: From SMF 30 for CICS, Db2, MQ in Usage Data Section, often see IMS in transition, and limitations with Db2 12 onwards – function Levels eg M507 vs M500.

        • All this from SMF 30 and 74-2 XCF, run through Martin’s own tooling. The point of mentioning it is to get the ideas out.

        • MXG discussion ongoing – especially about SMF 30 Usage Data Section. Believe the issues have been ironed out.

    • The Bottom Layer – Application Components

      • Things like Db2 and MQ Accounting Trace: Give names in great detail, likewise CICS Monitor Trace (but these are very heavy but very valuable)

      • DDF topology can be handy

      • Data sets – from SMF 42-6, 14, 15, 16, 62, 64

    • Fulfilling the definiton Of Architecture?

      • Understand what’s really going on

      • In a way that’s especially valuable to real systems architects

  • Customers should consider this approach because you can keep an up to date view of the estate and inform your own capacity and performance role.

  • Martin will keep exploring this rich avenue because dabbling in architecture keeps him interested, and lets him find patterns among customers.


Topics – Notifications


  • They are unsolicited messages appearing on the device, generally interruptions whether wanted or not.

  • Can contain text and images, so really just messages. Taking a platform-neutral view follows.

  • Where do they come from?

    • Form all sorts of services, many on the web

    • Need polling applications on the device

    • Can drain the battery

  • What do you receive and where?

    • Some on Watch: For example from my bank immediately from a contactless payment, and from a washing machine.

    • Most on phone: Slack notifications, from when Google home wifi is out. Sometimes inappropriate, like at Carnegie Hall.

    • A few on iPad: For example WordPress notifications

    • Very few on Mac

    • Very few on web browser: Only when someone replies on favourite newsgroups

    • A key way of managing them is trying not to duplicate across multiple devices

    • IFTTT still a favorite of Marna’s, mentioned in Episode 23 “The Preview That We Do” Topics topic. Martin just subscribed to Premium, but at a low-ball price.

    • Constant interruptions not always welcome, especially those with sounds.

  • How do you manage them?

    • Try to reduce their number. Often have “Wood for the trees” problem.

    • On iOS at least apps can stack/group them

    • Many are just plain spammy

      • Many are defaulted to on when you install an app
    • Figure out which devices should get what and watch for “must always receive immediately”

    • Only the place they make sense, such as exercise chivvies, letting each app has its own settings

  • An advanced topic is how you can generate your own

    • IFTTT from them service (emails, messages), from the app (weather tomorrow).

    • On iOS the Pushcut app: a webhook from eg IFTTT can cause a notification

      • Pushcut webhook support includes the ability to trigger actions on the iOS device, using cURL to invoke webhooks
    • On Android we have MacroDroid.

    • If you have an Amazon Echo or related device you can use web hooks with the Notify Me skill to send notifications to the Echo

  • Lots of scope for things to go wrong with automation based on notifications


Customer requirements


  • Customer Requirement 138729 : z/OS Console message help support JES2 messages

    • Use case: Hover over a JES2 messages or other prefixed message to get the messages help.

    • IBM response: This RFE has been delivered via APAR PH24072 at end of June 2020. It’s available on both V2R3 and V2R4. The JES2 message prefix such as $ can be setup in “Configure Message Help” dialog in z/OSMF Console UI.

    • “Configure Message Help”, there is a location for your KC4Z hostname so you can can retrieve message help from the Knowledge Center when you hover over message ID.

    • Note might be usable in principle be used by other subsystems, but it was raised as a z/OSMF requirement for JES2.


On the blog


SAP And Db2 Correlation IDs

Every so often I get to work with a SAP customer. I’m pleased to say I’ve worked with four in recent years. I work with them sufficiently infrequently that my DDF SMF 101 Analysis code has evolved somewhat in the meantime.

The customer situation I’m working with now is a good case in point. And so I want to share a few things from it. There is no “throwing anyone under the bus” but I think what I’ve learnt is interesting. I’m sure it’s not everything there is to learn about SAP, so I won’t pretend it is.

The Structure Of SAP Db2 Correlation IDs

In Db2 a correlation ID (or corrid) is a 12-character name. Decoding it takes some care. For example:

  • For a batch job up to the first 8 characters are the job name.
  • For a CICS transaction characters 5 to 8 are the transaction ID.

In this set of data the correlation ID is interesting and useful:

  • The first three characters are the Db2 Datasharing group name (or SAP application name).
  • The next three are “DIA” or “BTC” – usually. Occasionally we get something else in these 3 positions.
  • Characters 7 to 9 are a number – but encoded in EBCDIC so you can read them.

I wouldn’t say that all SAP implementations are like this, but there will be something similar about them – and that’s good enough. We can do useful work with this.

Exploring – Using Correlation IDs

Gaul might indeed be divided into three parts. (“Gallia est omnis divisa in partes tres”). So let’s take the three parts of the SAP Correlation ID:

Db2 Datasharing Group Name / Application Name

To be honest, this one isn’t exciting – unless the Datasharing Group Name is different from the SAP Application Name. This is because:

  • Each SAP application has one and only one (or zero) Datasharing Groups.
  • Accounting Trace already contains the Datasharing Group Name.

In my DDF SMF 101 Analysis code I’m largely ignoring this part of the Correlation ID, therefore.

BTC Versus DIA

The vast majority of the records have “BTC” or “DIA” in them, and this post will ignore the very few others. Consider the words “have “BTC” or “DIA” in them”. I chose my words carefully: these strings might not be at offsets 3 to 5. Here’s a technique that makes that not matter.

I could use exact equality in DFSORT. Meaning a specific position is where the match has to happen. However DFSORT also supports substring search.

Here is the syntax for an exact match condition:


INCLUDE COND=(CORRID46,EQ,C'BTC')

Here I’ve had to remap the ID field to map positions 4 to 6 (offsets 3 to 5). That’s a symbol I don’t really want and it isn’t flexible enough.

Here’s how it would look using a substring search condition:


INCLUDE COND=(CORRID,SS,EQ,C'BTC')

This is much better as I don’t need an extra symbol definition and the string could be anywhere in the 12 bytes of the CORRID field.

If we can distinguish between Batch (“BTC”) and Dialog (“DIA”) we can do useful things. We can show commits and CPU by time of day – by Batch versus Dialog. We could do Time Of Day anyway, without this distinction. (My DDF SMF 101 Analysis code can go down to 100th of a second granularity – because that’s the SMF time stamp granularity – so I regularly summarise by time of day.) But this distinction allows us to see a Batch Window, or times when Batch is prevalent. If we are trying to understand the operating regime, such distinctions can be handy.

Numeric Suffix

This is the tricky one. Let’s take an example: “XYZBTC083”

We’re talking about the “083” part. It looks like a batch job identifier within a suite. But it isn’t. For a start, such a naming convention would not survive in a busy shop. So what onis it?

There are a few clues:

  • “XYZBTC083” occurs throughout the set of data, day and night. So it’s not a finite-runtime batch job.
  • In the (QWHS) Standard Header the Logical Unit Of Work ID fields for “XYZBTC083” change.
  • the “083” is one value in a contiguous range of suffixes.

What we really have here are SAP Application Server processes, each with their own threads. These threads appear to get renewed every so often. Work (somehow) runs in these processes and, when it goes to Db2, it uses these threads. It’s probably controllable when these threads get terminated and replaced – but I don’t see compelling evidence in the data for that control.

This “083” suffix is interesting: In one SAP application I see a range of “XYZDIA00” – “XYZDIA49”. Then I see “XYZBTC50” – “XYZBTC89”. So, in this example, that’s 50 Dialog processes and 40 Batch processes. So that’s some architectural information right there. What I don’t know is whether lowering the number of processes is an effective workload throttle, nor whether there are other controls in the SAP Application Server layer on threads into Db2. I do know – in other DDF applications – it’s better to queue in the middle tier (or client application) than queue too much in Db2.

IP Addresses And Client Software

Every SMF 101 record has an IP Address (or LU Name). In this case I see a consistent set of a small number of IP addresses. These I consider to be the Application Servers. I also see Linux on X86 64-Bit (“Linux/X8664”) as the client software. I also see it’s level.

So we’re building up a sense of the application landscape, albeit rudimentary. In this case client machines. (Middle tier machines, often – if we’re taking the more general DDF case than SAP.)

Towards Applications With QMDAAPPL

When a client connects to Db2 via DDF it can pass certain identifying strings. One of these shows up in SMF 101 in a 20-byte field – QMDAAPPL.

SAP sets this string, so it’s possible to see quite a high degree of fine detail in what’s coming over the wire. It’s early days in my exploration if this – with my DDF SMF 101 Analysis code – but here are two things I’ve noticed, looking at two SAP applications:

  • Each application has a very few QMDAAPPL values that burn the bulk of the CPU.
  • Each application has a distinctly different (though probably not totally disjoint) set of QMDAAPPL values.

I’ve looked up a few of the names on the web. I’ve seen enough to convince me I could tell what the purpose of a given SAP application is, just from these names. Expect that as a future “stunt”. 🙂

Conclusion

I think I’ve shown you can do useful work – with Db2 Accounting Trace (SMF 101) – in understanding SAP accessing Db2 via DDF.

SAP is different from many other types of DDF work – and you’ve seen evidence of that in this long post.

One final point: SAP work comes in short commits / transactions – which makes it especially difficult for WLM to manage. In this set of data, for instance, there is relatively little scope for period aging. We have to consider other mechanisms – such as

  • Using the Correlation ID structure to separate Batch from Dialog.
  • Using DDF Profiles to manage inbound work.
  • (Shudder) using WLM resource groups.

And, as I mentioned above,

  • Using SAP’s own mechanisms for managing work.

I’ve learnt a fair bit from this customer situation, building as it does on previous ones. Yes, I’m still learning at pace. One day I might even feel competent. 🙂

And it inspires me even more to consider releasing my DDF SMF 101 Analysis code. Stay tuned!