Recent Conference Presentations

(Originally posted 2013-06-24.)

I’ve been fortunate enough to speak at two conferences in the past month:

  • UKCMG Annual Conference – in London in May
  • System z Technical University – just outside Munich in June

I presented three different sets of material, with very little overlap:

The first two were given at both conferences and the third one only in Munich.

If you click on the links you’ll see I’ve uploaded them all to Slideshare. I considered it only fair to wait until after the Munich conference to upload them.


I thoroughly enjoyed both conferences. But then I usually do: It’s my prime method of learning (other than by doing, and by data analysis) and I’m always pleased to run into old friends and new. (As a firm believer in “strangers are just friends we haven’t met yet” there were a fair number of those, too.)

It was particularly good to see so many people from Hursley – both CICS and Websphere MQ. As well as the usual other Development labs.

If there were to be only one new feature I spotted it would be the support for Variable Blocked Spanned (VBS) records with REXX’s EXECIO in z/OS 2.1 TSO. Think “SMF processing with REXX”. In my residency this autumn I expect to have a chance to play with this.


I’m writing this using Byword on an iPad Mini, using a Logitech external keyboard – on a flight to South Africa. It’s proving to be a good combination, especially as Byword has MultiMarkdown support (a set of extensions to Markdown) and Evernote supported. It’ll probably be my new writing rig for a while.

Of course it helps if you put the iPad Mini in the right way round. 🙂


And a week has passed since I wrote this. 😦 It’s been a busy week that’s really been an interesting test of the idea of “treating an address space as a black box”. (And so much more besides.) Perhaps I’ll write about it some time.

Dragging REXX Into The 21st Century?

(Originally posted 2013-06-07.)

I like REXX but sometimes it leaves a little to be desired. This post is about a technique for dealing with some of the issues. I present it in the hope some of you will find it worth building on, or using directly.

Note: I’m talking about Classic REXX and not Open Object REXX.

List Comprehensions are widespread in modern programming languages – because they express concisely otherwise verbose concepts – such as looping.

Here’s an example from javascript:

var numbers = [1, 4, 9];
var roots = numbers.map(Math.sqrt);
/* roots is now [1, 2, 3], numbers is still [1, 4, 9] */

It’s taken from here which is a good description of javascript’s support for arrays.

Essentially it applies the square root function (Math.sqrt) to each element of the array numbers, using the map method. Even though it processes every element there’s no loop in sight. This, to me, is quite elegant and very maintainable. It gets rid of a lot of looping cruft that adds no value.

My Challenge

I have a lot of REXX code – essential to fetch data from the performance databases I build and turn it into graphs and tabular reports. Much of this code iterates over stem variables (similar to arrays – for the non-REXX reader) or character strings that are tokens separated by spaces (blanks).

An example of a blank-delimited token string is:

address_spaces="CICSIP01 CICSIP02 CICSPA CICXYZ DB1ADBM1 MQ1AMSTR MQ1ACHIN"

It would be really nice when processing such a string – perhaps to pick up all the tokens beginning “CICS” – to be able to do it simply. Perhaps an incantation like:

cics_regions=filter("find","CICS",address_spaces)

In this example the filter routine applies the find routine to each token in the string, with a parameter "CICS" (the search argument).

And not a loop in sight.

My Experiment

I implemented versions of map, filter and reduce. I’ll talk about how but first here’s what they do:

Function Purpose
map Applies a routine to each element
filter Creates a subset of the string with each element being kept or discarded based on the routine’s return value (1 to keep and 0 to throw the item away)
reduce Produce a result based on an initial value and applying a routine to each element.

Here’s a simple version of filter:

filter: procedure 
parse arg f,p1,p2,p3,list 
if list="" then do 
  parse arg f,p1,p2,list 
  if list="" then do 
    parse arg f,p1,list 
    if list="" then do 
      funstem="keepit="f"(" 
    end 
    else do 
      funstem="keepit="f"("p1"," 
    end 
  end 
  else do 
    funstem="keepit="f"("p1","p2"," 
  end 
end 
else do 
  funstem="keepit="f"("p1","p2","p3","
end 
outlist="" 
do forever 
  parse value list with item list 
  interpret funstem""item")" 
  if keepit=1 then do 
    if outlist="" then do 
      outlist=item 
    end 
    else do 
      outlist=outlist item
    end 
  end 
  if list="" then leave 
end 
return outlist 

Variable “list” is the input space-separated list. “outlist” is the output list that filter builds – in the same space-separated list format.

Much of this is in fact parameter handling: The p1, p2, p3 optional parameters need checking for. But the “heavy lifting” comes in three parts:

  • Breaking the string into tokens (or items, if you prefer).

  • Using interpret to invoke the filter function (named in variable f) against each token.

  • Checking the value of the keepit variable on return from the filter function:

    If it’s 1 then keep the item. If not then remove it from the list.

I also wrote a filter called “grepFilter” (amongst others). Recall the example above where I wanted to find the string “CICS” at the beginning of a token. That could’ve been done with a filter that checked for pos("CICS",item)=1. That’s obviously a very simple case. grepFilter, as the name suggests, uses grep against each token. It worked nicely (though I suggest it fails my long-standing “minimise the transitions between REXX and Unix through BPXWUNIX” test).

And then I got playing with examples, including “pipelining” – from, say, map to filter to reduce – such as:

say reduce("sum",0,filter("gt",8,map("timesit",2,"1 2 3 4 5 6")))

Issues

There are a number of issues with this approach:

  • You’ll notice the function name (first parameter in the filter example above) is in fact a character string.

    It’s not a function reference as other languages would see it. REXX doesn’t have a first class function data type. Suppose you didn’t have a procedure of that name in your code: You’d get some weird error messages at run time. And while you can pass around character strings all you want the semantics are different from passing around function references.

  • The vital piece of REXX that makes this technique possible is the interpret instruction.

    It’s very powerful but comes at a bit of a cost: When the REXX interpreter starts it tokenises the REXX exec – for performance reasons. It can’t tokenise the string passed to interpret. So performance could suffer. For my use cases most of the time (and CPU time) is spent in commands (scripted by REXX) rather than in the REXX code itself. (I also think the process of mapping a function to a list suffers less than the average REXX instruction if run through interpret.

  • The requirement to write, for example

    say reduce("sum",0,filter("gt",8,map("timesit",2,"1 2 3 4 5 6")))

    rather than

    say "1 2 3 4 5 6".map("timesit",2).filter("gt",8).reduce("sum",0)

    is inelegant. Fixing this would require subverting a major portion of what REXX is. And that’s not what I’m trying to do.

  • The need to apply a function to each item – particularly in the filter case – can be overkill.

    In my Production code I can write

    filter("item>8","1 2 4 8 16 32")

    as I check the first parameter for characters such as “>” and “=”. So no filtering function required.

  • REXX doesn’t have anonymous functions and I can’t think of a way to simulate them. Can you? If you look at the linked Wikipedia entry it shows how expressive they can be.

These are worth thinking about but not – I would submit – show stoppers. They just require care in using these techniques and sensible expectations.

Conclusions

It’s perfectly possible to do some modern things in REXX – if you work at it. And this post has been the result of experimentation. Experimentation which I’m going to use directly in some of my programs. (In fact I’ve taken the prototype code and extended it for Production. I’ve kept it “simple” here.)

I’d note that “CMS Pipelines” would do some of this – but not all. And in any case most people don’t have CMS Pipelines – whether on VM or ported to TSO. (TSO is my case, but mostly in batch.)

I don’t believe “Classic” REXX to be under active development so asking for new features is probably a waste of time. Hence my tack of simulating them, and living with the limitations of the simulation: It still makes for clearer, more maintainable code.

Care to try to simulate other modern language features? Lambda or Currying would be pretty similar.

Of course if I had kept my blinkers on then I wouldn’t know about all these programming concepts and wouldn’t be trying to apply them to REXX. But where’s the fun in that?

Data Collection Requirements

(Originally posted 2013-06-01.)

Over the years I’ve written emails with data collection requirements dozens of times, with varying degrees of clarity. It would be better, wouldn’t it, to write it once. I don’t think I can get out of the business of writing such emails entirely but here’s a goodly chunk of it.

Another thing that struck me is that the value of some types of data has increased enormously over time. So some data that might’ve been in the “I don’t mind if you don’t send it” category has been elevated to “things will be much better if you do”.

I’d like to articulate that more clearly.

Bare Minimum

I always want SMF Types 70 through to 79, whatever you have.

I put it this loosely because I have almost 100% success with getting the data I really need, because the customer’s already collecting it. That has the distinct advantage of allowing me to ask for historical data, whether for longitudinal studies or just to capture a “problem” period.

There’s rarely been a study where I didn’t want RMF data. (Occasionally I’ve only wanted DB2 Accounting Trace.)

I like to have this data for a few days, but sometimes that isn’t possible. What really freaks my code out is having just a few intervals to work with.

Another question is “for which systems?” That’s a more difficult question to answer. Certainly I want all the major systems in the relevant sysplex. Ideally *all the systems in the sysplex and even all the systems on the machines involved. But that’s usually not realistic. You’ve probably guessed already that the nearer to that ideal the better the insights (probably).

Strongly Enhancing

As you’ll’ve seen from this blog Type 30 Subtypes 2 and 3 Interval records are of increasing value to me: I recently ran some 15 month old data through the latest level of my code and was gratified at how much more insight I gained into how the customer’s systems worked.

A flavour of this is described in posts like Another Usage Of Usage Information.

So this data has definitely moved from the category of “I can just break down CPU usage a little further” to “it will make a large difference if you send it”.

Here the systems and time ranges I’d prefer to see data from can be much less: I probably don’t need to see this data from the Sysprogs’ sandpit.

Nice To Have

With most studies I can get by without the WLM Service Definition but it helps in certain circumstances (as I mentioned in Playing Spot The Difference With WLM Service Definitions.)

I’m OK with either the WLM ISPF TLIB or the XML version (as mentioned in that post).

If I want to take disk performance down to below the volume level SMF 42–6 Data Set Performance records are a must. It’s also the case you can learn an awful lot about DB2 table space and index space fragments from 42–6, there being a well-known naming convention for DB2 data sets.

Specialist Subjects

The above is common to most studies. The following deals with more common specialist needs.

DB2

Most of the time I’m seeking to explain one of two things about DB2: * Where the CPU is going. * Where the time is going. In both cases I need DB2 Accounting Trace (SMF 101). The quality of this is variable. For example, to get CPU down to the Package (Program) level I need Trace Classes 7 and 8, in addition to the usual 1, 2 and 3. (Sometimes even 2 and 3 aren’t on.)

It’s quite likely this isn’t data that in its full glory is being collected all the time. So it’s a 30% chance of getting this retrospectively.

Sometimes I’m keen to understand the DB2 subsystem, which is where Statistics Trace comes in. The default statistics interval (STATIME) used to be a horrendous 30 minutes. Now it’s much lower so I’m pleased that issue has gone away. I ask for Trace Classes 1,3,4,5,6,8,10 which result in SMF 100 and 102 records. (I don’t ask for Performance Trace which also results in 102 records, albeit different ones.)

Again the questions of “for which subsystems?” and “when for?” come into play. That’s where negotiation is important:

  • It’s a lot of data to send.
  • Some installations deem it too expensive to collect on a continual basis.

I don’t disagree with either of those.

CICS

Here Statistics Trace (SMF 110) is really useful – especially if you have a sensible statistics interval.

For application response time and CPU breakdown to the transaction level Monitor Trace is the thing. Again this is the sort of thing customers don’t keep on a regular basis – or for that many regions. It’s also quite prone to breakage: Some customers remove fields from the record with a customised Monitor Control Table (MCT).

I try to glean what I can from SMF 30 about CICS – as numerous blog posts have pointed out – because I can get it for many more CICS regions than the CICS-specific data would furnish.

Batch

Batch is the area that takes in a widest range of data sources, the most fundamental of which are SMF 30 Subtypes 4 (Step-End) and 5 (Job-End).

I’ve already mentioned DB2 Accounting Trace and it’s most of what you need for understanding the timings of DB2 jobs.

For VSAM data sets SMF 62 is OPEN and 64 is CLOSE. For non-VSAM SMF 14 is Reads and 15 is Writes.

For DFSORT SMF 16 is really handy and even better if SMF=FULL is in effect. (Often it isn’t but I generally wouldn’t stop data collection to fix that.)

MQ

I only occasionally look at Websphere MQ, though I’d like to do much more with it. I don’t think many people are familiar with the data. Statistics Trace (analogous to DB2’s but different) is SMF 115. Accounting Trace – which deals with applications – is SMF 116.

If I want to see what connects to MQ the Usage Information in SMF 30 is generally enough – though it doesn’t tell me much about work coming in through the CHIN address space. For that I really do need Accounting Trace. (An analogous information can be made about remote access to DB2.)

Getting Data To Me

This is usually OK but it’s worth reminding people of a few simple rules:

  1. Always use IFASMFDP (and / or IFASMFDL) to move SMF records around. (Using IEBGENER will probably break them.)
  2. TERSE the data using AMATERSE (or perhaps TRSMAIN). This works fine for both SMF data and ISPF TLIBs (and I suppose partitioned data sets in general).
  3. FTP the data BINARY to ECUREP.
  4. Make sure you send the data to the right directory in ECUREP’s file store. The standard encoding of the PMR number helps a number of IBM systems (such as RETAIN) to work swiftly and effectively. I’ll give you a PMR number on my queue (UZPMOS) or we can use another one.

That seems like a lot of rules but most of it should be familiar to anyone who’s ever sent in documentation in support of a PMR. (Only Rule 1 is new.) If you have access to the PMR text – and quite a few customers do – it should also enable you to track the data inbound.

In Conclusion

Realistically people might not have all the data I want and so there’s a process of negotiation, mainly between timeliness and retrospectiveness versus quality. Clarifying that trade off would be helpful, which is why I like to run a Data Collection Kick-Off call. Ideally that call would be face to face, but I’m less insistent on that if distance makes it difficult.

At the end of the day I’m quite flexible and do what I can with whatever data I get. Of course you can’t magic missing data out of thin air, and can only occasionally repair it.

What I hope is that data collection is not overly burdensome and doesn’t cause stress to the customer. I also like to think that when they’ve sent the data in they can relax and more or less forget about it until “showtime”. 🙂

I also hope customers understand why they’ve been asked for the data they have. And that’s part of the point of this post, the rest being articulating what I need.

REXX That’s Sensitive To Where It’s Called From

(Originally posted 2013-05-26.)

I have REXX code that can be called directly by TSO (in DD SYSTSIN data) or else by another REXX function. I want it to behave differently in each case:

  • If called directly from TSO I want it to print something.
  • If called from another function I want it to return some values to the calling function.

So I thought about how to do this. The answer’s quite simple: Use the parse source command and examine the second word returned.

Here’s a simple example, the function myfunc.

/* REXX myfunc */
interpret "x=1;y=2" 
parse source . envt . 
if envt="COMMAND" then do 
  /* Called from top level command */
    say x y 
end 
else do 
  /* Called from procedure or function */
  return x y 
end 

The interpret command is a fancy way of assigning two variables (and really a leftover from another test). It works but you would normally code

x=1
y=2

The parse source command returns a number of words but it’s the second one that is of interest – and is saved in the variable envt.

The following lines test whether envt has the value “COMMAND” or not. If so the function’s been called directly from TSO. If not it hasn’t. In the one case the variable values are printed. In the other they’re returned to the calling routine.

The calling routine might call have a line similar to

parse value myfunc() with x y

Which would unpack the two variables.

(The original interpret "x=1;y=2" might look silly but interpret "x='1 first';y='2 second'" doesn’t – as a way of passing strings with arbitrary spaces in them back from a routine.)

This is a simplified version of something I want to do in my own code. There might be a few people who will find this useful so why keep it to myself? 🙂 (It’s probably a standard technique nobody taught me.) 🙂

Discovering Report Class / Service Class Correspondences

(Originally posted 2013-05-22.)

It’s possible I’ve written something about this before: My blog is so extensive now it’s hard to find out exactly what I’ve written about (and I’m going to have to do something about that).

I say “written something” because I know for sure I haven’t written about the SMF record field I want to introduce you to now.

Previously

If when you send me data you include Type 30 interval records I’ll use them to relate WLM Service Classes to Report Classes: Workload, Service Class and Report Class are all in there.

But these records are only for address spaces. Address spaces that actually got created. And therein lies a problem: Only some of the Service Class / Report Class relationships can be gleaned this way.

In practice I’ve found this (incomplete but not inaccurate) information handy. So I’d like to fill in some gaps.

New News

I expect you didn’t know this either – so I call it “new news”: There’s a handy field in SMF 72 Subtype 3 (Workload Activity Report) called R723PLSC. It has nothing to do with PSLC.

This is defined as the “Service Class that last contributed to this Report Class period during this interval.” I’ve highlighted the word “last” as its quite important but we’ll come back to that in a minute.

This allows you to see some relationships for work that isn’t represented by address spaces, for instance DDF. (In my test data it’s DDF I’m seeing.)

I’ve spent some time adding this in to my code. Usually I’d summarise over several hours. In this case if I do I miss stuff.

The emphasised “last” above means that only one of the (potentially several) Service Classes that correspond to this Report Class shows up in the record. So I use a set of rows, each representing a short interval, to get the correspondence. In my test data this approach yields more correspondences – as somehow the last one often isn’t always the same one from interval to interval.

If you use Report Classes to break out a subset of a Service Class the “last Service Class” issue doesn’t arise. If you use Report Classes for aggregation (or in a hybrid way) it certainly does.

(I’m not all that keen on using Report Classes for aggregation anyway: Decent reporting tools can do that for you. But I could be persuaded. I’m keener on using them for breakouts, such as DDF applications that share a common Service Class, or to break out an address space or several.)

I’m not claiming to have got all the Service Class / Report Class correspondences but I’ve got more of them – and for an important set of cases: Service Classes and Report Classes that don’t correspond to address spaces.

As you’ll see in Playing Spot The Difference With WLM Service Definitions I prefer to have the WLM Service Definition to work with – and I’ll be asking for it more fervently in the future. But you have to work with the data you can readily obtain. And R723PLSC is a handy field to have learnt about. You might find it useful, too.

Playing Spot The Difference With WLM Service Definitions

(Originally posted 2013-05-20.)

A customer asked me to examine two WLM service definition snapshots taken on adjacent days – and discern any differences. This is not a challenge I’ve been set before – and so I expect it’s pretty rare. But, thinking about it, I reckon it could be quite useful.

So they sent me two service definitions, one day apart, and one from months later. When I say “a service definition” I mean the ISPF table library (TLIB) in which it’s stored. TERSEd and sent via FTP BINARY they make the trip just fine.

On my z/OS system I can fire up the WLM Service Definition Editor and read (and even edit) these just fine. (I could print the policy in a number of different formats – report, CSV file or GML. But I don’t choose to.)

I could even compare the two TLIBs but I don’t think that’d be a useful or consumable comparison. Likewise the three kinds of policy prints I just mentioned.

So, I decided to revisit an approach I’ve posted about before: Processing the XML version of the service definition.

Done right this could be a way to make a meaningful comparison – because it’s cogniscent of the structure of the Service Definitions.

When I last looked at the XML version I think I was on Windows. Now I’m on Linux. Unperturbed I downloaded the WLM Service Definition Editor – which is a .exe file. But it unpacks – with Archive Manager – anyway. Inside is, amongst other things, a jar file and two REXX execs – ISPF2XML and XML2ISPF.

(You can run the jar file under Linux and the SD Editor works. But that’s not what I was after.)

The ISPF2XML exec runs under REXX/ISPF and produces a very nice XML file.

In conversation with the author it transpires this isn’t actively maintained anymore and one should probably use the one from z/OSMF now. (I’d be interested in seeing some XML from the new editor – if anyone is using it and is willing to send me a file.)

Again, I could compare two XML files but I don’t think a raw textual comparison would be very helpful.

A Byproduct – An XML Formatter For HTML

Because the XML file is well-formed it’s quite easy to parse it. And it is quite comprehensible, in its own terms.

So, I wrote some PHP code that uses XPATH to process the file. Why PHP? Because I’m creating HTML (and serving it from Apache on my laptop to itself) and its XPATH support is very good.

So now I have a nice Formatter for the whole Service Definition: If you get to send me your XML I get to send you the HTML. And at the same time I get to understand more about how the classification side of your WLM policy works.

(RMF has nothing to say about WLM classification rules’ firing: I use SMF 30 to try to guess this stuff – but that has limitations. And for DDF work I have to rely on the DB2 Accounting Trace QWACWLME field.)

Comparing Two Service Definitions

I only got part way through writing the comparison code: I compared the first-level nodes (children of the root node, such as the ClassificationRules element) by using the saveHTML method and lexically comparing the strings produced.

Correctly this told me the overnight changes were twofold:

  • A new report class had been added.
  • A new classification rule had been added, assigning this report class to some DDF work.

I say “correctly” because eyeballing my HTML reports told me the same thing: Opening the two of them in tabs, scrolling both to the top and then paging each down in turn took about 5 minutes. Doing it that slowly convinced me I’d spotted all the differences. (It helps I have built in some navigation aids along the way – which I won’t bore you with.)

Timestamps and Userids

Remember the two service definitions I’m comparing are from adjacent days. It’s handy they have timestamps for when e.g. A resource group was created. And always there’s a matching “update” timestamps.

Furthermore each timestamp is accompanied by a userid.

Sometimes these are goofy – “1900–01–01” 🙂 or “N.N” – but it’s nice to see “CLW” 🙂 appear, indicating the provenance of the service definition. (Even the goofy ones are, of course, meaningful.)

More seriously a timestamp between one day and the next is helpful.

Conclusion

RMF only gives you so much (but it’s a lot): Sometimes you need to go much further. And the XML version, whatever you do with it, fits the bill nicely.

Comparing two Service Definitions helps identify when changes were made. Picking up the timestamps narrows the doubt even more. And knowing the userids that authored changes helps drive discussions about who changed what and why.

You could readily save copies of the ISPF TLIB (or the XML) on a daily or weekly basis, and compare generations.

What I’d like to know is if a WLM Service Definition comparison tool would be generally useful for customers. Well would it?

New Batch Residency

(Originally posted 2013-05-02.)

In October Frank Kyne and I expect to run a residency in Poughkeepsie. You can find the announcement here.

The residency builds on the ideas presented here and three subsequent posts.

I revisited a specific part of it in Cloning Fan-In.

So what are we going to do?

For a start we’re going to assemble a team of 4 skilled mainframe folks from wherever we can. 🙂 One of them will be me, which leaves 3. You could be one of those – but only if you throw your hat in the ring.

We’re looking for three distinct roles:

  • Someone with good scheduling (TWS) and JCL skills.
  • Someone with experience in writing and tuning COBOL / DB2 programs.
  • The same but for the pairing of PL/I and VSAM.

Actually there’s some flexibility in these last two roles: COBOL / VSAM and PL/I / DB2 would work just fine.

But I still haven’t told you what we’ll actually do…

Residency Goal

We aim to teach people how to successfully clone individual batch jobs – through examples and guidance.

How We’ll Do It

The referenced blog posts describe some theory. This residency will write a Redbook that’ll describe the practice.

We’ll create two test cases that we’ll assert want cloning. They’ll process a large number of records / rows – in a loop. This is a very common application pattern: If you can think of another one we’ll entertain it.

One program will be written in COBOL and the other in PL/I. Hence the programming skill requirements.

One will access DB2 data primarily, the other mainly VSAM. Which explains those two skill requirements.

So the first few days will create these baselines – and measure them.

Then we’ll investigate cloning – 2-up, then 4-up, then 8-up, etc..

Why This Is Non-Trivial

If these cases were read-only this might be trivial. If these cases didn’t write a summary report at the end the same might be true.

But we won’t make it so easy on ourselves:

  • We’ll update something in each case – a file or a DB2 table.
  • We’ll read a second file / table (a lookup table, if you will).
  • We’ll write a report at the end.

All these reflect real life problems people will have.

And if the residents can think of some more pain to inflict on ourselves we will. 🙂

The Report

As I mentioned in Cloning Fan-In many programs produce a report. This is easy before cloning. With cloning it’s much harder. So we need to exercise that.

But I posited a modified architecture: Create data files and merge them in a separate reporting job. JSON could be involved, and so could XML – as those files should be modern. I say that because one benefit of cloning a job could be making the reporting data available to other consumers.

If we have time someone could explore this.

Why We Need A Scheduling Person

First, real life would require you to integrate a cloned job into Production: Scheduling one job, complete with recovery is one thing. Scheduling cloned jobs is another.

Second, it’s not enough to succeed once in cloning a job: Installations will want to automate splitting again and maybe even dynamically decide how many clones. (And maybe where they’ll run.)

The TWS person will not only do scheduling but figure out how best to structure the JCL. Though not the main thrust of this residency, z/OS 2.1 will have JCL enhancements I guess to be useful here. We’ll have 2.1 on our LPAR so you can play with this.

If you’ve never played with BatchPipes/MVS I expect you’ll get to try it out, too.

Measurements

While we overtly state this is not a formal benchmark, we’ll take lots of measurements and tune accordingly.

This I’m expecting to play the main role in.

Write Up

The idea of this is to deliver practical guidance through real life case studies. So there’ll be a book and maybe a presentation.

We’ll document what we did, what issues arose, how we resolved them, and what we learnt. And this will draw on all our perspectives.

As the application programs aren’t the main deliverable they’ll probably go in appendices. Tweaks we have to make to the code, JCL and schedule will be highlighted. Reporting requirements will also be described.

Finally

I think this will be a lot of fun. I also think the contact with Development will be fruitful.

So I invite you all to consider applying. Nominations close 5 July.

Analysing A WLM Policy – Part 2

(Originally posted 2013-05-01.)

This is the second part, following on from Part 1.

Importance Versus Velocity

After drawing out the hierarchy you have to set actual goals – whether velocity or some form of response time. And you have to set importances.

The importances should now be easy as they flow from the hierarchy. IRLM should be in SYSSTC – which serves as an anchor at the top. It’s not quite as simple, though, as assigning from 1 downwards – perhaps with gaps. You might find there are too many hierarchical steps and you have to decide how to conflate some.

It’s important to understand that Importance trumps Velocity: Importance 1 goals are satisfied first, then 2 and so on.

But a low Velocity service class period with Importance 1 might well have its goal satisfied too easily and WLM will then go on to satisfy less important service class periods rather than trying to overachieve the Imp 1 goal.

Further, a velocity goal that is always overachieved provides no protection on those occasions when resources become constrained: The attainment could well be dragged all the way down to the goal.

At the other extreme an overly aggressive velocity goal can lead to WLM giving up on the goal.

It sounds against the spirit of WLM but I’d set a goal at roughly normal attainment – assuming this level provides acceptable performance,

Actually the same things apply to response time goals: Importance overrides and setting the tightness of the goal right is important.

What’s In A Name?

I’ve seen enough WLM policies now to know they fall into three categories:

I know these because of the names therein. (By the way it’s the third category I see the most problems in.)

The first two contain names which are rhetorically useful such as “STCHI”. It’s better not to either name them something too specific – in case you have to repurpose them – or to encode the goal values in the name – in case you have to adjust them (as you probably will).

By the way the same applies to the descriptions – which appear in SMF 72–3. If I ever learn Serbo-Croat it’ll probably be from SMF. 🙂

The Importance Of Instrumentation

Having just mentioned SMF let me talk about instrumentation.

Recall I was asked to look at a WLM policy.

Initially I was sent a WLM policy print. I then asked for (and swiftly got) appropriate SMF.

The point is it’s both you need:

  • The policy (in whatever form) gives you the rhetoric.
  • The SMF gives you the reality of how it performs.

Note the SMF doesn’t give you classification rules but the policy obviously does.

As an aside I’ve posited to Development it would be useful to instrument which classification rules fire with what frequency. Do you agree?

The most obvious use case is figuring out which rules are actually worthwhile, not that that’s a major theme in WLM tuning. I suspect there are others.

I’d like to thank Dougie Lawson and Colin Paice for their help in thinking about certain subsystems they are more conversant with than I am. This whole discussion would’ve been a lot worse without their input.

Analysing A WLM Policy – Part 1

(Originally posted 2013-05-01.)

This post started out with the title “Insufficient Nosiness?” I think most of mine do. 🙂 And if they do they should be subtitled “What You Don’t Know Can Still Harm You”. 🙂

Since then its scope’s expanded somewhat and now it’s in two parts, the second part being here.

A lot of things have come together recently…

I’ve just been involved in a discussion with a customer – which stretched me but I believe that was a good thing. Almost all I can tell you about the situation is that it involved some design work around their WLM policy, And that their installation has lots of lovely complexity.

I’ve had “warm up gigs” 🙂 recently in that I’ve been involved in several discussions about how to classify subsystems – for example CICS, DB2 and DDF. But none of these has been as comprehensive as this one. Hence the “stretching”.

If I’m looking for “lessons learned” (and I think I always am) they’d be a heady mix of things I didn’t know, things I did know that got brought into sharp relief, and new ways of structuring my thinking.

I’ll admit to walking in with a little uncertainty that I could think my way through a WLM policy review but I gave it some thought and I emerged from the discussions much happier about it.

It struck me the first thing to do is to discover what address space serves what. (Generally speaking it is an address space that serves, but it often isn’t an address space that gets served – DDF transactions being a good example.)

The motivation for this – and I think it’s well known – is that work should not be allowed starve address spaces that serve it of CPU. The reason for labouring the point about serving hierarchy is that this structure gets quite complex to follow. Previous customer discussions hadn’t thrown this “real world” complexity into sharp relief: They’d only exposed parts of the hierarchy (such as the previously mentioned DB2 portion). Typically people talk about simplish things like within product relationships.

Here’s a typical CICS one: I’m advocating a more comprehensive approach. (I was going to write a presentation about just that: within product considerations. I now think it’ll be a different presentation if it emerges at all.)

We didn’t actually draw the hierarchy on a piece of paper: I think next time I actually will create a physical drawing.

Discerning The Hierarchy

This directly draws on previously-mentioned information sources, such as SMF 30 Usage data, or DB2 Accounting Trace. The discussions this week laid out the hierarchy by people talking. Perish the thought. 🙂 Actually the Usage information did get a look in, in a supporting role.

I think you can start wherever you like. Perhaps because it’s quite complex you could start with DB2 (if relevant):

By the way the horizontal lines are boundaries between categories. You might find value in using “must be above” arrows between components instead.

The following is the previous two combined.

On reflection this is getting to the limit before arrows are required.

Also notice the TOR is viewed as the anchor point for the CICS application. You could argue the TOR need not be below DBM1. But I’d try and separate them if at all possible.

 

The second part of this two-part post is here.

How Many Eggs In Which Baskets?

(Originally posted 2013-04-08.)

You wouldn’t put all your eggs in one basket, CICSwise, would you? A naive reading of the CICS TS 5.1 announcement materials might lead you to suppose you could. This post is about thinking about your CICS region portfolio in the light of this announcement.

While every CICS release introduces capabilities that makes it worthwhile to review your region portfolio, 5.1 majors on scalability. So, in the months (hopefully only months) before you install 5.1 and eventually go live, it would be a good idea to review your CICS region portfolio.

(I properly should say “application” rather than “region” – but for us Performance Folks we’re more likely to get involved in discussions about regions than applications. But we should still take a more-than-polite interest in applications. However, this post is indeed rather more about regions than applications.)

So let’s review why installations split applications up into multiple regions. There are essentially three:

  • Architectural
  • Availability
  • Performance and scalability

When reviewing your portfolio it’s worth looking at all these categories.

And to me one of the major benefits of 5.1 is that it gives you more choices.

Architecture

You’re probably thinking I protest too much about not being an architect. I’ve talked about it enough times. 🙂

What I would say is it’s worth understanding the role of each CICS region.

  • You can begin by using the SMF 30 Usage information – as I discuss in Another Usage Of Usage Information. In that post I point out you can get topology information – such as which MQ or DB2 subsystem a region uses – just from SMF 30.
  • The above trick won’t detect File-Owning Regions (FOR’s). For that you probably could spot one from the Disk EXCP counts in SMF 30 or, failing that, in SMF 42–6.
  • You could have some fun with region names – as I discuss in He Picks On CICS.
  • You could use CICS’ own Performance Trace – and I think CICS Performance Analyzer helps with this – to figure out how transactions flow.
  • Or you could actually talk to CICS people. 🙂 Actually that’s not an exclusive or.

From the above you can get to knowing which regions are part of which application, can tell FOR’s from AOR’s from DOR’s from QOR’s from TOR’s, and generally have a crack at figuring out how set up for available it all is. All before breakfast. 🙂

Hmm. I think I’m going to have to write me some more code… 🙂

And, of course, in 5.1 the architectural choices increase again.

Availability

Personally I recommend having at least four servers for resilience, though that is sometimes unaffordable.

The reason I recommend four rather than two is quite straightforward: If running out of a resource causes a server to fail only having two means the other one is likely to fail as well. Having three others makes it much more likely the survivors could handle the load. Virtual Storage is a good example of this.

Of course there’s a cost to provisioning four rather than two – day in day out. Consider four way Data Sharing: Thankfully the difference between non- and two-way- is usually greater than the cost between two-way- and four-way-Data Sharing.

Each installation must make its own decisions on availability versus cost.

Performance and Scalability

There have traditionally been two reasons for limiting the size of CICS region, performancewise:

  • QR TCB Constraint
  • Virtual Storage

QR TCB Constraint

I wrote about this in New CPU Information In SMF Type 30 Records, where I posited the new CPU metrics introduced into SMF Type 30 in APAR OA39629 could help establish if the QR TCB is large.

In early client data I consistently see the biggest TCB in CICS regions as being “DFHKETCB” so I think this is the QR TCB. I decode this string as “DFH for CICS”, followed by “KE for Kernel” and “TCB is TCB”, so this all makes sense to me.

In any case you could work with the SMF 30 TCB time: If a significant portion of an engine you might look at the biggest TCB. Whether that is the QR TCB or not a large % of an engine for Biggest TCB would warrant examination. If it is the QR TCB then you have work to do before such a region could be combined with others.

For example, a CICS region with 90% of an engine at peak would warrant further investigation: If the biggest TCB were DFHKETCB and only 20% of an engine you could combine maybe 3 such regions without concern for QR TCB constraint.

If, however, the QR TCB were larger you’d want to consider the appropriateness of Threadsafe before concluding regions couldn’t be merged.

In 5.1 more commands have been made Threadsafe, as has the Transient Data (TD) Facility. This follows all the extensions to Threadsafe applicability over prior releases. (See Threadsafe Considerations for CICS.)

Virtual Storage

Historically CICS has used 24-, 31- and 64-bit virtual storage: Both 24- and 31-bit virtual storage should be viewed as scarce resources, especially 24-bit.

As a coarse upper bound you can use the SMF 30 Allocated virtual storage numbers.

For example, a region with less than 2MB of 24-bit allocated is probably not threatening when combined with a few others. Similarly a region with less than 500MB of 31-bit allocated is probably not an issue if combined with one or two more.

I emphasis coarse because CICS suballocates memory and has its own sophisticated memory management regime. You should use the CICS Statistics Trace virtual storage numbers to treat this subject properly.

In 5.1 a substantial number of areas have been moved to 31-bit virtual storage from 24-bit. Similarly, a substantial number of areas have moved from 31-bit to 64-bit.

Benefits Of Merging Regions

It’s worth pointing out that there are advantages in reducing the number of CICS regions. Two in particular come to mind:

  • Reduced operational complexity
  • Potentially improved resource usage and performance.

Others can much better explain the operational benefits. As a primarily performance guy I consider questions of resource consumption and effectiveness. Two simple examples are:

  • CICS doesn’t load a program each time a transaction that uses it runs: It keeps it in virtual storage. Two regions potentially means two copies – which would require twice the real memory. One region obviously doesn’t.

  • In the case of VSAM (LSR) buffer pools two regions require two pools for every one that a single region would have. Again, to get the same buffer pool effectiveness is highly likely to require twice the amount of real memory to back the pools as in the single region case.

Conclusion

In the examples in this post I gave some numbers. Please don’t use them as rules of thumb – without applying further thought. They are just reasonable examples: Derive your own.

Further, this whole discussion has been necessarily simplistic. But I think asking some basic questions is a very good start. Hopefully I’ve given you a way to look at whether CICS TS 5.1 (and indeed 4.2 or any other release, but less so) provides an opportunity to rework your portfolio of CICS regions and applications.

To recap, if anything, 5.1 gives you choices. (Actually it gives you lots of other things but the focus on this post has been narrow: How many eggs in how few baskets?)

Talking of those other things 5.1 brings CICS Transaction Server for z/OS Version 5 Release 1 What’s New is well worth a read.

I’m wondering whether it would be useful to work this post up into a presentation on the topic – probably with considerable help from people who major on CICS. What do you think?

Also, I considered inserting some graphics but thought the ones I came up with to be gratuitous and unhelpful. So I didn’t. So there. 🙂