Mainframe, Performance, Topics

GreaseMonkey Script To Sum Selected Numbers In A Web Page

(Originally posted 2014-12-13.)

This post is meant to inspire people who like programming the web to do simple tasks. It contains a sample Firefox GreaseMonkey [1] script, which I hope you will find useful. [2]

Suppose you are looking at a web page, perhaps one with a table in, and you want to add up some numbers you see there. Perhaps they’re in a column in that table.

With this script you select the numbers you’re interested it and push the “Sum Up” button that appears at the top.

Here’s the script:

    // ==UserScript==
    // @name        swipeCalc
    // @namespace   MGLP
    // @description Does calculations on selected text
    // @version     0.0
    // ==/UserScript==

    input=document.createElement("input")
    input.type="button";
    input.value="Sum Up";
    input.onclick = showResult;
    document.body.insertBefore(input,document.body.firstChild)

    function showResult()
    {
      // Get array of space- and range-separated tokens
        selection=window.getSelection()
        var tokens=[]
        for(r=0;r<selection.rangeCount;r++){
            rangeTokens=selection.getRangeAt(r).toString().split(" ")
            for(t=0;t<rangeTokens.length;t++){
                tokens.push(rangeTokens[t])
            }
        }

        // Sum up any detected number values
        tally=0
        count=0
        maximum=Math.max()
        minimum=Math.min()
        for(t=0;t<tokens.length;t++){
            tokenValue=parseFloat(tokens[t])
            if(!isNaN(tokenValue)){
                tally+=tokenValue
                count++
                maximum=Math.max(maximum,tokenValue)
                minimum=Math.min(minimum,tokenValue)
            }
        }
        alert("Sum: "+tally+" Average: "+tally/count+" Minimum: "+minimum+" Maximum: "+maximum)
    }

Everything up to the function definition is code to insert a button at the top of every page, with the words “Sum Up” on it. [3] When you push the button it invokes the function “showResult”

Most of the code is the function “showResult”, which is actually quite simple:

First we break up the (perhaps disjoint) selected text into words.
Then we loop through all the tokens, using parseFloat() to turn them into floating-point numbers. [4]

For each valid floating point we use it to contribute to the sum, average, maximum and minimum tallies.
We display in an alert the statistics we’ve computed.

I hope this script doesn’t look intimidating. Personally I intend to build on it – as there’s so much more we can do with GreaseMonkey.

When I wrote about GreaseMonkey in 2005 I didn’t then realise I’d be using it to prototype Firefox Extensions. I just might do that here. But, of course, time is fleeting and there are lots of challenges out there, with more appearing (it seems) every day.

Greasemonkey is a Firefox extension. I first wrote about it here (in 2005): GreaseMonkey ↩
Feel free to swipe it and build on it. ↩
You might not like the behaviour of adding the button to all pages. GreaseMonkey makes it easy to control that. ↩
parseFloat is a built-in javascript function that treats anything up to the first non-number-related character as a valid floating point number. So “1234X” would be treated as 1234. It’s not quite up to the most sophisticated of requirements but it’s simple and useful. ↩

How I Look At Virtual Storage

(Originally posted 2014-12-07.)

I thought I’d write about how I begin looking at virtual storage, occasioned by a customer who had a 24-bit virtual storage (878–10) ABEND[1] . Most of this is in our code, so easy for me to do. I hope you’ll find it similarly easy.

You’ve had hints of this in How Many Eggs In Which Baskets? and Broker And SMF 30.

The game is really to use product-neutral instrumentation first. Then use the product-specific instrumentation (where available) only where you really need it.

This post won’t talk about the latter; It’s reasonably well documented.

Almost everything I’m about to say is equally true of 24-bit and 31-bit virtual storage; Just the field names are different in the records.

I divide the analysis into two parts:

System Virtual Storage
Address Space Virtual Storage

The data for the former comes entirely from RMF’s SMF Type 78 Subtype 2. For the latter it’s in the SMF Type 30 record (though you can also get SMF 78–2 information for a small number of suitably-chosen specific address spaces.

System Virtual Storage

Here we establish how the Common areas are defined and used, together with the size of the Private area.

This is easy to do. In this example the data is all 24 Bit.

The highest level view looks like this:

This is a static picture.

The first thing to notice is that the Common Boundary is at 9MB.[2]

In case you need a larger Private Area examine the largest allocated area of Common Storage and it is, in this case, CSA.[3]

Here’s how I begin to drill down into CSA usage (rather than allocation):

(You might want to pop it out into a different tab – as it’s quite wide.)

Each line, except the first two, represents a different time range. In this case I’m summarising at the half-hourly level. [4]

The first two lines are maxima and minima. In this case they show only slight variation through the day, which is good to know.

The first thing I notice about this is there is some SQA free (and in the 31-bit case I print a “SQA Overflow Into CSA” column instead – as that’s what’s happening with this set of data). SQA can overflow into CSA but not vice versa – so this is wasted virtual. I recommended decreasing the SQA size by, say, 250KB [5]

That’s not strictly necessary as the Minimum CSA free (rightmost column) is over 2.5MB. So an easy way to get a 10MB region is to decrease the CSA size by 1MB. 2MB would be pushing it – as the customer wants to allow the whole of Key 7 to be duplicated. [6]

Talking of which, you can work out which Storage Protection Keys represent the bulk of the CSA usage. In this case it’s Key 0, Key 7 and Key 8+. [7]

But you can go further. Look at this:

It’s Subpool 241 Key 0 that’s the biggest, then Subpool 228 for Key 8+, then evenly split Subpools 231 and 241 for Key 7.

I’ve deliberately summarised this table over a “shift” – to avoid having to make the table 3-dimensional 🙂 . But the conclusion is still clear.

If you had to tune CSA usage you’d use this information and drill down further.

For completeness, here’s the Subpool breakdown for SQA:

It’s obvious Subpool 245 dominates.

Two things to note:

You don’t get either the CSA or the SQA Subpool breakdown for 31-bit (as it’s not in the data).
You can’t break down the SQA Subpool usage by Storage Protection Key.

But you can see a good breakdown of Common Storage can be had from RMF 78–2 data.

Fortunately, we can already do enough with just a CSA reduction (and perhaps with a SQA reduction) to get us at least 1MB more 24-bit Private.

Address Space Virtual Storage

Here we examine Private Area Allocated [8] Virtual Storage for individual address spaces.

While it’s possible to examine Private Area virtual storage with SMF 78 Subtype 2 data most customers don’t set up monitoring of individual address spaces.

Instead SMF 30 Interval records tend to be readily available. So from these you can get 24- and 31-Bit “Low Private” and “High Private” numbers.

In the example customer’s case their IMS DL/I SAS (address space) takes almost the whole of the 24-bit Private Area. About half is Low Private and half is High Private.[9]

Interval-on-interval there is little change is Private Area usage (but I don’t have data from the day the ABEND happened – yet). It’s just clear this address space is always “close to the edge”.

So, working on the 24-bit virtual storage usage by this address space is strongly indicated. But having a 1MB bigger 24-bit region can’t hurt either. So I’ve advised doing both.

Conclusion

I hope you can see it’s quite easy to analyse virtual storage with SMF 78–2 and SMF 30 Interval records.

I’d say that – with a few exceptions – you shouldn’t obsess over virtual storage. What you should do is consider setting up some “light touch” monitoring; Perhaps CSA Free and SQA Free (24- and 31-bit) and Virtual Allocated for some key address spaces. That way the usual applies:

You get to know how your systems operate.
You can spot impending problems.

And I’m going to make sure in future we always pump out this sort of reporting – assuming the data’s there. And generally it is: You’re collecting it so use it.

It’s the ABEND macro so while I will tolerate “abend” I prefer to write “ABEND”. ↩
This is about average; Some systems have a higher Common Boundary and I’ve worked with lower (in the late 1980’s at 6MB, for example). ↩
PLPA is also quite large, so you might study that also. ↩
In this and subsequent diagrams any cell with a dot in it means a “small non-zero value”. It’s quite a useful device. ↩
That’s right kilobytes. 🙂 ↩
Think “the entire CSA for an IMS subsystem orphaned after a crash”. ↩
RMF doesn’t break out Keys 8 to 15 separately. ↩
In How Many Eggs In Which Baskets? I point out the difference between Allocated and Used, particularly as CICS manages its own virtual storage (as do DB2 and many other products). ↩
There is no such 31-bit crunch, as it happens. ↩

The Effect Of CF Structure Distance

(Originally posted 2014-11-29.)

Here’s an interesting case that illustrates the effect of distance between a z/OS image and a Coupling Facility structure.

I don’t think this will embarrass the customer; It’s not untypical of what I see. If anything I’m the one that should be slightly embarrassed, as you’ll see…

A customer has two machines, 3 kilometers apart, with an ICF in each machine and Parallel Sysplex members in each machine. There is one major (head and shoulders above the rest) structure: IRRXCF_P001 (with a backup IRRXCF_B001 in the other ICF).

The vast majority of the traffic to this structure is from a system on the “remote” machine (the one 3km distant from the ICF).

At this point I’ll admit I’d not paid much attention to IRRXCF_Pnnn and IRRXCF_Bnnn structures in the past – largely because traffic to them is typically lower than other structures such as DB2 LOCK1 and ISGLOCK (GRS Star). I hadn’t even twigged that this cache structure was accessed via requests that were initially Synchronous. (And that’s the tiny bit of embarrassment I’ll admit to.)

RACF IRRXCF_* Structures

Let me share some of what the z/OS Security Server RACF System Programmer’s Guide says:

To use RACF data sharing you need one cache structure for each data set specified in your RACF data set name table (ICHRNDST). For example, if you have one primary data set and one backup data set, you need to define two cache structures.

The format of RACF cache structure names is IRRXCF00_tnnn where:

t is “P” for primary or “B” for backup
nnn is the relative position of the data set in the data set name table (a decimal number, 001-090)

So the gist of this is that it’s a cache for RACF requests and it’s accessed Synchronously – at least in theory.

So what does “accessed Synchronously” actually mean?

RACF issues requests to XES, which is the component of z/OS that actually communicates with Coupling Facility structures. XES can decide to convert Synchronous requests to Asynchronous, heuristically.¹ So in this case it means RACF requested Synchronous; XES might have converted to Async, depending on sampled service times.

A Case In Point

Now to the example:

Three systems (SYSA, SYSC and SYSD) are on the footprint 3km away from the structure. SYSB is on the same footprint as the structure.

Let’s first examine the traffic rate:

As you can see the vast majority of the traffic is from SYSC and almost all the traffic is Async (and RMF doesn’t know if RACF issues the requests Sync (and they were converted to Async) or if RACF issued them Async). The fact all three remote systems are more red than blue, while the local one (SYSB) is all blue, suggests RACF issues them Sync (as if we didn’t know).

Now let’s look at the response times:

Here the local system (SYSB) using IC Peer (ICP) links shows a very nice response time under 5µs.

The remote systems (using Infiniband (CIB) links) show response times between 55 and 270µs.

A reasonable question to ask is “how come SYSC has so much better response times than SYSA and SYSD as they’re on the same footprint?”

You might think it’s because it has some kind of Practice Effect; You drive a higher request rate and the service time improves (which XCF’s Coupling Facility structures often appear to exhibit).

But here’s a graph – just for SYSC – which disproves this:

I’ve sequenced 30-minute data points not by time of day but by request rate. Here are the highlights:

The response time stays level at 50-60&us. regardless of the request rate. So it’s not a Practice Effect.
The percent of requests that are Sync stays very low – so conversion is almost always happening. (At least it’s consistent.)
The Coupling Facility CPU per request (not Coupled (z/OS) CPU) is around 5&percent. of the response time; The rest is the effect of going Async as well as the link time.

Now SYSA has a much lower LPAR weight than SYSD, which in turn has a lower LPAR weight than SYSC. The response times are the opposite: SYSC lowest, then SYSD, then SYSA.

So we have (negative) correlation. But what about causation?

Well, I’ve seen this before:

When a request is Async (whether converted or not) the request completes by the caller being tapped on the shoulder. In a PR/SM environment this can’t happen until the coupled (z/OS) LPAR’s logical engine is dispatched on a physical engine. If the weight for the LPAR (or vertically for the engine in the Hiperdispatch case) is low it might take a while for the logical to be dispatched on a physical.

The consequence is that lower-weighted LPARs get worse response times for Async because of the time taken to deliver the “completion” signal.

A Happy Ending

My advice to the customer was to move the IRRXCF_P001 structure to the Coupling Facility on the same footprint as the busiest LPAR (SYSC).

They did this and the response time dropped to 4µs, with the vast majority of the requests now being Sync.

This is an unusual case in that normally one LPAR doesn’t dominate the traffic to a structure. So the choice of where to put the structure was unusually easy.

I would add two things related to the applications on SYSC:

They are IMS and I don’t know enough about IMS Security to know whether it is possible to tune down the SAF requests to reduce the Coupling Facility request rate, without compromising security.
There is no direct correlation between IRRXCF_P001 service time and IMS transaction response times. Such is often the way.

But this has been an interesting and instructive case to work through. And you could consider this blog post penance for mistakenly thinking RACF Coupling Facility requests were always Async. 🙂

In a similar vein you might like:

zBC12 As A Standalone Coupling Facility? from earlier this year
System zEC12 CFLEVEL 18 RMF Instrumentation Improvements
Coupling Facility Memory where the numbers are made up, unlike in this case.

¹ But XES can’t convert Async requests to Sync.

Coupling Facility Memory

(Originally posted 2014-11-23.)

Or “who made all the pies”? 🙂

I’ve written a number of times about Coupling Facility Performance but I don’t think I’ve written about memory for a while.

In any case I’d like to share with you a couple of graphs I’ve taught my code to make. The first isn’t strictly speaking specific to Coupling Facilities. But it’s useful anyway and does help tell the story.

Machine-Level Memory Allocation

This graph is, as I said, applicable to any machine – whether it has Coupling Facility LPARs or not.

LPAR Memory allocation is more or less static, so a pie chart is appropriate. In fact my code averages over a shift. [1]

What z/OS and RMF don’t know is the amount of memory (purchased) on a machine – so the Unallocated memory can’t be depicted without the analyst (me) manually telling the code.[2] Which is why I’m not showing it here.

In this (confected) example memory usage is dominated by two Linux [3] (probably under z/VM) LPARs and MVSA. There are also two Coupling Facility [4] LPARS – PRODCF1 and PRODCF3.

This data is from SMF 70 Partition Data Report data, having avoided double-counting if e.g. zIIPs are present. In my real code I also group together the small LPARs.

Coupling Facility Memory

So let’s continue with our fictional example by examining one of these two Coupling Facility LPARs and drilling down:

This is from Coupling Facility Activity (SMF 74–4) data.

In this case I do know the Unallocated memory and so show it on the graph. It’s actually useful because it can help drive some conversations like:

What’s your “white space strategy?”
Perhaps you could increase the size of, say, GBP 10.

In this example I’ve marked GBP 10 with a *. In my code this denotes the structure is already at its maximum size: When defining a structure you specify its Maximum, Minimum and Initial sizes. Whether by manual intervention or using the AUTOALTER mechanism[5] the structure might grow to the Maximum. RMF documents these sizes so it’s easy to test whether a structure has grown to its limit.

Just because a structure has grown to the limit doesn’t mean you have to increase the size: RMF also documents whether the structure is full and whether this fullness is leading to such things as Directory Entry Reclaims or Data Element Reclaims. The extent to which this matters depends on the structure exploiter.

For example, the two XCF structures have their own instrumentation (SMF 74–2) which could be handy. In this case one could guess that one structure is for standard size (956 byte) messages and the other for large messages – but 74–2 data would confirm this and document the traffic.

“White space” is a complex subject and involves understanding how structures are allocated in Coupling Facilities and are to be recovered. I won’t try and do it justice in this post. But it’s a major reason why memory is left unallocated in Coupling Facility LPARs. Basically you should ask questions like “what happens if this structure fails?” and “how would I recover structures in the event of a failing Coupling Facility LPAR?” White space can be part of the answer.

Making Pies

I generally don’t like pie charts. But for the two cases outlined above they’re OK. The data is relatively static, though occasionally the values and names do change.

In this particular case the pie charts fill a niche in “storytelling”: I needed something to talk about where a machine’s memory is going. And I needed something to talk about the memory inside a Coupling Facility image.

What I wouldn’t like is to flash up these two charts and not dig deeper. Well, the data allows for much deeper digging; And I’m not inclined to stay shallow… 🙂

After An Indecent Interval

(Originally posted 2014-11-16.)

In After A Decent Interval I talked about the need for frequently-cut SMF Interval records. This post is about bad behaviours (or maybe not so bad, depending on your point of view).

It’s actually an exploration of when interval-related records get cut, which turned into a bit of a “Think Friday” experiment. But I think – quite apart from the interest – it has some usefulness in my “day job”.

I share it with you in the hope you’ll find it interesting and perhaps useful.

And it’s in some ways related to what I wrote about in The End Is Nigh For CICS.

Code To Analyse SMF Timestamps

Let’s start with a simple DFSORT job to analyse the minutes when SMF 30 records are cut – in the hour. It’s restricted to looking at SMF 30 Subtype 2 Interval records. These are the 30’s one would most expect to have cut on a regular interval. (Subtype 3 records are cut when a step ends – to complement the Subtype 2’s.)

In reality the analysis I’m sharing with you uses more complex forms of this basic job. But we’ll get on to the analysis in a minute.

The first step is very simple and deletes a report data set. The second step writes to this data set and, identically, to the SPOOL. [1]

The INCLUDE statement throws away all record types other than SMF 30 Subtype 2.

The INREC statement turns the surviving records into two fields:

The record’s time in the form hhmmss, for example ‘084500’ for 8:45AM.
A 4-byte field with the value ‘1’ in.

These two fields are used in the SORT statement to sort by the minute portion of the timestamp and the SUM statement to count the number of records with a given minute.

The purpose of the OUTFIL statement is to format the report for two destinations. It produces a two-column report. The first is the minute and the second the number of records whose timestamp is within that minute. For example:

MIN             Records
---             -----------
 00                   49844
 01                      23
 02                      45
 ..                      ..

Here’s the code.

//DELOUT   EXEC PGM=IDCAMS 
//SYSPRINT  DD  SYSOUT=K,HOLD=YES 
//SYSIN     DD  * 
 DELETE <your.report.file> PURGE
    IF MAXCC = 8 THEN SET MAXCC = 0 
/* 
//* 
//HIST     EXEC PGM=ICEMAN 
//SYSPRINT DD SYSOUT=K,HOLD=YES 
//SYSOUT   DD SYSOUT=K,HOLD=YES 
//SYMNOUT  DD SYSOUT=K,HOLD=YES 
//SYMNAMES DD * 
* INPUT RECORD 
POSITION,1 
RDW,*,4,BI 
SKIP,1 
RTY,*,1,BI 
TME,*,4,BI 
SKIP,4 
SID,*,4,CH 
WID,*,4,CH 
STP,*,2,BI 
* 
* AFTER INREC 
POSITION,1 
_RDW,*,4,BI 
_HOUR,*,2,CH 
_MIN,*,2,CH 
_SEC,*,2,CH 
_T30,*,4,BI 
//SYSIN   DD * 
  OPTION VLSCMP 
  INCLUDE COND=(RTY,EQ,+30,AND,STP,EQ,+2) 
  INREC FIELDS=(RDW,TME,TM1,X'00000001') 
  SORT FIELDS=(_MIN,A) 
  SUM FIELDS=(_T30,BI) 
  OUTFIL FNAMES=(SORTOUT,TESTOUT), 
  OUTREC=(_RDW,X,_MIN,X,_T30,EDIT=(IIIIIIIIIT)), 
  HEADER1=('MIN',X,'   RECORDS',/,'---',X,'----------')
//TESTOUT   DD SYSOUT=K,HOLD=YES 
//SORTIN    DD  DISP=SHR, 
//             DSN=<your.input.data.set>
//SORTOUT   DD DISP=(NEW,CATLG),UNIT=PM,RECFM=VBA, 
//            SPACE=(CYL,(1,1),RLSE),DATACLAS=FASTBAT, 
//        DSN=<your.report.file>

Of course you’ll need to fiddle with things like the DATACLAS parameter and which output class you use. I’m assuming that’s within the skill set of anybody reading this post.

A “Well Behaved” Case

The following graph is from a customer whose SMF 30–2 and RMF records all appear, regular as clockwork, every 15 minutes. I’m showing SMF 72 (Workload Activity) records stacked on SMF 30–2 – as these are of similar volumes. [2]

In reality a few 30–2 records are cut every minute but no RMF records are cut “off the beat”.

Drilling down to seconds, as the next graph shows, almost all the records are cut in the first second of the minute – both SMF 30–2 and 72. [3]

A Less Tidy Case

Contrast the above example with another customer, whose data is less well behaved. [4]

In this case the RMF records are all cut “on the beat”; It’s the SMF 30–2 records that aren’t.

In fact this is data from just one system out of many.[5]

In the previous case I surmise Interval Synchronisation was used (SYNCVAL and INTVAL parameters in SMFPRMxx) but in this case my best guess is that it isn’t.

Looking at Reader Start Times (which are in each Interval record, just as they are in Subtypes 4 and 5 for Step- and Job-End recording), the time when an SMF 30 is cut is determined by when the address space started.[6]

Let’s drill down a little, using field SMF30WID in the SMF Header; It gives the subsystem as SMF (and SMFPRMxx in particular) sees it:

Superficially it looks as if the SMF interval is 10 minutes; It isn’t. It’s actually 30 minutes. The high peaks are thirty minutes apart but there is something going on every 10 minutes, affecting STC and probably OMVS. It’s something I’ll want to discuss with the customer.

Conclusion

You might ask why I care about this sort of thing.

Partly it’s curiosity, sparked in this case by occasional glimpses that things aren’t as simple as they appear: If you really want to believe Interval records are tidily cut on interval boundaries that’ fine – but occasionally the fine (or not so fine) structure will up and bite you.

In my case the code I use occasionally produces bad graphs because it summarises records on 15, 30, or whatever minute intervals and records fall into the wrong interval. I’d like to at least be able to explain it.

But more generally, I have to pick a summarisation interval. Understanding how frequently and tidily Interval records are cut enables me to do that. I’m going to put code to do a basic form of this analysis into our process – right after we fetch the raw SMF from where you send it to (ECUREP, probably). This will save no end of time – as rebuilds of our performance databases and reruns of reporting can be reduced.

And, if nothing else it’s prompted me to re-read the SMFPRMxx section in z/OS Initialization And Tuning Reference.

Now that can’t be a bad thing. 🙂

That’s what the OUTFIL FNAMES parameter achieves. ↩
In my Production code I actually break down by SMFID and by record type in the range 70 to 79. ↩
In a busier or bigger system it might take more than 1 second on an interval to cut all the records; I look forward to seeing if that’s the case. ↩
It’s not really a moral judgment, but I expect this sort of data to cause more problems. ↩
My actual Production code report by system – in order to see the differences at a system level. ↩
Thanks to Dave Betten and some SMF 30 code of his it’s possible for me to see the Reader Start Time – which isn’t in the SMF Header and so can’t rigorously be processed by a simple DFSORT job. ↩

What’s The Point Of WLM?

(Originally posted 2014-11-09.)

At UK GSE Annual Conference I presented on DB2 and Workload Manager. It occurred to me that one of the slides was a good basis for a blog post, posing the question “what’s the point of WLM?” And this was the slide, with me “for scale purposes”. 🙂

(Thanks to Karen Wilkins for the photograph.)

So let me try to give you a synopsis of my view, expanding on each of the points on the slide.

Allows Scaling Like DFSMS

Back in 1988 I was one of the IBM Systems Engineers (SE’s) who supported a major UK customer in beta’ing DFSMS.[1] So I remember well the improvements in Storage Management that DFSMS brought.

Most notably the growth in data – data sets and volumes – was predicted to become unmanageable with the old ways of doing things. DFSMS, being Policy-Driven, provided constructs that enabled large numbers of volumes and data sets to be managed quickly.

The word “policy” is key to the analogy; WLM is also policy-driven, providing the same kind of leverage.[2] For many customers it would be inconceivable to manage performance with Compatibility Mode – even if it were still supported; The people cost would be too high, with the complexity of modern environments.

Much Simpler Than ICS / IPS / OPT

I’ll confess to never having been entirely comfortable with ICS / IPS / OPT; Sure I understood the mechanics but it was too early in my career to gather much experience of how it actually operated in real customer environments.[3]

There are, of course, people who “grew up” 🙂 with Compat Mode (and probably watched it evolve) and for them I’m sure it makes perfect sense.

For the pedants, yes we still have OPT (IEAOPTxx) but it is much simpler now.[4] And what was got rid of I think we can happily live without.[5]

Can Manage Newer Stuff

It’s been so long now since WLM became the only game in town that I forget the myriad enhancements that assume it’s present. So I’ll take one example area: Server Address Spaces.

There are at least three functions that use some variant of the Server Address Space mechanism:

WLM-Managed Initiators.
Websphere Application Server address spaces.
DB2 Stored Procedure server address spaces.[6]

All three of them rely on WLM to balance system conditions against goal attainment when deciding on whether to start additional address spaces. There was nothing like it in Compat Mode.

As I said, it’s one area and there are numerous others.

Can Manage Stuff “Properly”

I said WLM was policy-driven. From the outset the rhetoric was that you could couch the policy in business terms. For response time goals that’s obviously true. For velocity goals it’s little less clear.

Certainly WLM Importance can be used to separate, with clarity, important work from less important work.

So I think WLM enables you to much more closely align performance specifications with business goals.

Conclusion

This has been a brief synopsis. Much more and “TL;DR”[7] definitely would apply. And because it’s brief it’s had to be selective.

And if you think it egotistical of me to post a photo of myself, consider I look different from my previous avatar; Clearly older, but I probably don’t look wiser. 🙂

One final thought: There’s an enormous amount of Performance Tuning that has nothing to do with WLM; It’s important to be realistic about that; And anyone who talks about WLM like its some panacea – and people do – needs to be reminded of that.

If I tell you I was an IBM SE you are supposed to understand my mindset and “get the hint”. The hint that I’ve been around a while and done interesting things. 🙂 ↩
Both DFSMS (through ISMF) and WLM are panel-driven, wherein you manage the policy. I already take the WLM ISPF TLIB and generate reporting from it. I wonder if the same approach would work with ISMF. ↩
And the instrumentation – mainly in RMF but also in SMF 30 – is so much better, which really helps. ↩
And, again for the pedants, what’s been added to OPT is new, rather than reversing the simplification. ↩
Anyone care to challenge that? ↩
I wrote the Server Address Space Management chapters of DB2 for z/OS Stored Procedures: Through the CALL and Beyond in 2003. ↩
Too Long; Didn’t Read ↩

Not So Much Renaissance Man More Tool-Using Ape :-)

(Originally posted 2014-11-02.)

If you come to my blog only for Performance- or SMF-related topics you’re going to be disappointed in this post. But if, like me, you’re interested in storytelling and web-related technologies then read on.

This post is about HTML5 Canvas – a technology I really like.

Some Of Why I Care About Web Technologies

To try and keep this focused I’m going to talk only about why web technologies are relevant to my “day job”.[1]

The tooling I curate and use was built over many years by many people. Its graphics are built on GDDM, and look like they date from the 1970s. But I’m not so concerned about how they look, so long as they tell the story well. [2]

But there are some stories that require some new methods of depiction, some new diagram types. Perhaps the ones in WLM Velocity – Another Fine Rhetorical Device I’ve Gotten Myself Into are a poor example of that. I don’t think I’ve shown you machine diagrams yet, but plenty of customers this year have seen them. And they’re a much better example of stuff that would require some quite low level GDDM programming.[3]

So I adopted a new approach, one that already yields nicer graphics than (I think) I could do with GDDM.

Step Forward HTML5 Canvas

Web standards, and HTML5 in particular, are slowly evolving. One of the most stable pieces is the “new” <canvas> tag. And it’s the one I find most immediately useful.

With canvas you use javascript to create diagrams. [4]

While many of you probably don’t know javascript a lot of people do and it’s a fine, readily learnable, language. It’s certainly fit for the purpose of manipulating character strings and driving diagram creation. [5]

Today I actually create the HTML and javascript using PHP – which is good for most things, especially parsing XML and HTML and string manipulation.

To use all this you need a modern web browser, of which more anon.

Note: You can build sophisticated, 3D, diagrams using WebGL. WebGL can use Canvas. But today I don’t use WebGL – but I have a book on it so one day I might.

Insufficiently Clever By Half?[6]

HTML5 Canvas is supported by most “modern” web browsers. You could say any browser unable to support Canvas is not a modern browser. But the degree of support varies by browser, and between browser releases. My recent experiences with Firefox Nightly shows it supports some drawing capabilities previous releases don’t – such as dashed lines. [7]

Support for drawing capabilities is one thing; Another is behaviour in the browser:

In Firefox right-clicking on a canvas element brings up a menu with a “View Image” item. This displays the graphic as a PNG. [8] This PNG can be copied or saved in a file.

Three snags:

It would be better workflow if Firefox allowed you to Copy or Save the graphic without having to View Image first.
When I last checked neither Mobile Safari nor Chrome have the same workflow.
Dragging the graphic into Symphony seems to cause the latter to loop. (And you can’t drag from the page with the canvas in.)

A glance at the spec suggests it doesn’t address how a browser should behave with the canvas element. I’m not saying it should but, and this is perhaps my conclusion, it would be really nice to see browsers competing with each other on how they handle canvas.

As it’s an open-source browser I’d quite like to fix it for Firefox, but I simply don’t have the time. 😦

But for now, it’s really satisfying to be able to generate diagrams this way that (to my eyes at least) look decent. And so far I have:

Machine diagrams
WLM depictions
Gantt charts – in colour [9] and with the scale in hours and minutes

And I’ll confess it’s been fun. 🙂

There are plenty of other reasons for liking web technologies, of course. ↩
One day we might hire a graphics designer – but finding one who knows GDDM is going to be tough. ↩
Albeit in REXX, probably. ↩
There are plenty of HTML5 Canvas tutorials on the web; None strikes me as overwhelmingly better than the rest. ↩
Learn it anyway; As a useful programming language in its own right. ↩
When people say something is “too clever by half” I think they really mean it’s “insufficiently clever by half”. ↩
I actually use this in my machine diagram and have to use a kind of polyfill for when I’m running in an older version of Firefox. ↩
Using a Data URI. ↩
This is something I haven’t been able to do before – colour – and I’m only just beginning to think of uses for colour coding in Gantt charts. ↩

The End Is Nigh For CICS

(Originally posted 2014-10-12.)

… and other address spaces, too. 🙂

In Once Upon A Restart I talked about how to detect IPLs and restarts of CICS regions and MQ subsystems (and other long-running address spaces) – from SMF Type 30 Interval records.

It’s easy to see starts but what about stops?[1]

It turns out you can estimate when address spaces stop from the SMF 30 Interval records (Subtypes 2 and 3):

When there is no longer a record for the address space (with a given Reader Start Time) the address space has terminated. So the last record for that job name with the given Reader Start Time marks when it came down.
When there is again a record with the same job name it will have a new Reader Start Time and the address space has come up again.[2]

This is actually a naive implementation but it gets me very close to when an address space comes down.

So What?

The flippant answer is that I extend what my tooling does because it pleases me to. 🙂

But actually that’s not true: To the extent that it needs a justification I’m more useful the closer I get to how my customers are running things, and to understanding their problems.

Specifically, in the handful of customers I’ve tested this code with, I have quite a good understanding of the relationship between CICS regions [3] and the batch. For example:

I see CICS regions come down and not come back up again for hours, sometimes on a timer pop and sometimes event-driven. This is usually overnight and I’m therefore seeing a Batch Window.
I see CICS regions come down and immediately restarted – in a way that suggests being put into read-only mode or to flip data sets. [4] Again this can be a sign of a batch window.

I see test regions come up for very short periods of time and then go down again. [5]

Actually, being (supposedly) open minded, I don’t know quite what I’ll see. But these are the sorts of things I think I’ll see.

Here’s a depiction of CICS coming down for Batch and restarting after:

CICS Down For Batch

and here’s a conflation of a number of scenarios where CICS gets bounced but is still up alongside batch. In this case it’s in “Read Only” mode:

CICS Read Only

Again, So What?

The answer to why this might be relevant to you is:

Many of you are looking after a plethoration [6] of systems and applications. This technique might save you time.

If I start talking to you about up and down times this might help you understand where I got it from. The words “see my blog” escape from my lips quite frequently these days.

And I expect I’ll be updating Life And Times Of An Address Space with this.

Yes you can use SMF 30 Subtypes 4 and 5 to get step- and job-end timings but I prefer not to make customers send me these. I might change my mind, one day. ↩
But I treat this as a new instance of the region / address space. ↩
It’s really only the CICS regions that get frequently restarted. But I’d notice if others did. ↩
In one customer case this is to pick up new versions of VSAM data sets the batch has created. ↩
I probably should pick up termination code to see if they ABENDed. Unfortunately there isn’t one as the Completion Section isn’t created for SMF 30 Subtypes 2 and 3 but only Subtypes 4 and 5. ↩
It probably should be “plethora” or “proliferation” but I like combining the two into “plethoration”. I hope you do, too. 🙂 ↩

Curiouser And Curiouser, Spike

(Originally posted 2014-09-28.)

As you’ve probably gathered I like to get nosy about how customers run systems. This is probably best and most recently exemplified by this blog post: Once Upon A Restart

So this post is about another piece of curiosity: What spikes can tell us about how people run systems. In a way it’s similar to what restarts tell us, hence the above blog post link.

I like “Think Fridays”. But I’ve been rather busy of late, so what I got to do this past Friday was brief, embryonic and just showing some of the potential of the method. In short it’s a prototype or an experiment. But, in line with the “Fink Thriday” 🙂 idea, it did get me thinking and exploring.

But such things don’t happen in a vacuum: I’ve noticed spikes in CPU and memory usage by address spaces before. Many times before. So that has gradually formed a question in my mind:

"Is there an event that triggers a spike in an address space?

Now, I’m not really thinking of the sorts of anomalies that zAware might learn to detect. I’m thinking of the more mundane “such and such happens every Tuesday night at 8PM” kind of event.

My Prototype

For my experiment / prototype I took a pair of LPARs. Let’s call them PROD and DEVT – for that is the roles these LPARs have.

I took SMF 30 Interval (Subtypes 2 and 3) records for both systems and examined a number of address spaces I’d spotted spiking:

DFHSM – on both systems, in STCMD.
DFRMM – on both systems, in STCMD.
CATALOG – on both systems, in SYSTEM.
An address space related to data extraction and transmission – on PROD, in STCMD.

For each of these I wrote code to examine CPU for each of these

It computes the Average CPU across the whole set of data for the address space.
It detects intervals where the address space uses at least 2x, 4x, 8x, 16x etc. the Average CPU.

Between these two I get spikes – whether broad or narrow, tall or short. Right now I just pump them out in a table – so lots of refinement can happen later on.

DFHSM

In PROD there’s a daily narrow spike around 5:30PM. And it’s a very substantial spike, CPUwise. So this looks like daily Space Management or similar daily functions. And its timing is regular as clockwork.

Here’s one day’s view of the service class that contains the DFHSM address space, as well as two of the other spiky address spaces.

In DEVT there’s a daily narrow spike around 8PM, but it’s not well-pronounced. But additionally there are lots of other, broader, episodes of well-above-average CPU consumption. The 8PM spike might well be Space Management or similar; It’s hard to tell. I expect Development LPARs will turn out to show this behaviour with DFHSM.

DFRMM

In PROD there are daily broad peaks – of around 45 minutes – just before the working day starts. But their incidence varies by as much as an hour and a half. Quite probably when the overnight Batch ends.

In DEVT there are narrower spikes at around the same time as PROD in the morning. But there are also narrow spikes around 8PM.

CATALOG

In PROD CATALOG has a number of spikes that line up with the previously-mentioned ones. As well as some in the evening Batch window.

But here the picture is less stark – largely because the general daytime level of CATALOG CPU drives up the overall average.

In DEVT CATALOG CPU usage varies enormously, with no obvious spikes and no clear pattern. That too is probably a feature of Development workloads.

So I won’t claim the “spike” treatment is such a success for CATALOG: You can see the spikes from the graph, but my prototype code doesn’t throw them and their timing into sharp relief. So maybe I just need to work on the code some more.

Data Extract / Transmission Address Space

This only runs in PROD. Every day this spikes for a brief while, regularly each morning around 2:30AM to 3AM.

This doesn’t appear to be on a “timer pop” so much as having prereqs, but I’m not 100% certain of this; That would be something to ask the customer.

SMF Interval Accuracy

Obviously, using interval records, the timing of an event can only be approximated using this method. Most customers I know use 15 or 30 minute intervals, which is fine. And our code picks the midpoint of the interval as a timestamp.

So we’re not going to detect events this way to more than 7.5 – 15 minutes’ accuracy. But I think that’s enough.

Events Dear Boy, Events

Now, having said I’m not really looking for anomalies a la zAware, there is already one case where I do see happenings of the undesirable kind.

In the test data I’m working with I see DUMPSRV (Dump Services) suddenly use more memory at two points in the day. After each of these events memory usage returns to a very low value. CPU doesn’t show the same spikiness.

From my restart code I can see that a CICS region restarts (unusually) right after the second spike. So, based purely on SMF 30 Interval records it’s a reasonable guess that the region ABENDed and dumped at the time of the second spike. Not conclusive, but a reasonable guess. And the relationships between certain spikes and restarts is worth exploring.

Other Address Spaces And Metrics

I made arbitrary choices of job name, based on this set of data. I could equally have roped in such things as Sterling Connect Direct.

And I could look at all sorts of spikes, such as in EXCP rate, Virtual Storage Allocation and zIIP Usage. To do that I might have to make the code more specialised; For example, with DUMPSRV only looking at memory usage (not CPU).

Timer Pops And Movable Feasts

Timing – as with restarts – is interesting to me:

If something kicks off bang on, say, 8PM every single day what does that mean?

Perhaps this is conservatively timed and could be earlier and event-driven.
If something kicks off at the same time every day, but not on a timing boundary, what does that mean?

It might mean the processes in front of it are regularly but take a few minutes to complete. For example: CICS comes down at 8PM exactly but a post-shutdown job always completes at 8:03, allowing the batch to always start at the same time.
If something moves around what does that mean?

Perhaps the chain of events it depends on takes a variable amount of time to complete, which might be a problem. For example backups kicked off when the batch completes.

Conclusion

So I’m not telling the customer what to do about these spikes; There probably is nothing for them to do. But I feel I’m getting closer to how they operate. And maybe I’m seeing some challenges, such as the variability of timing of things that happen just before the online day.

As this was a quick experiment there are obviously some rough edges, and there’s more it could do.

I think I’m edging towards a “Day In The Life” approach to systems, key address spaces and applications. It might include both spikes and restarts. And probably the general “double hump” etc. patterns in workload we usually see. Now that could be useful. The journey continues… 🙂

WLM Velocity – Another Fine Rhetorical Device I’ve Gotten Myself Into

(Originally posted 2014-09-21.)

Back in 2010 I wrote about a graph I’d developed for understanding how a Service Class Period’s velocity behaves. That post is here: WLM Velocity – “Rhetorical Devices Are Us”.

At the time I was concerned not to show up the customer by displaying the graph. I think that was the right decision. But in the presentation I mention here: Workload Manager And DB2 Presentation Abstract I do have an example. And indeed it’s a significant part of my “zIIP Capacity Planning” presentation (you can get from System z Technical University, Budapest 12–16 May 2014, Slides).

I regard that graph as a nice rhetorical device[1] as it has led to many fine discussions with customers (and I’ve tweaked it a little in the process).

But this blog post is about a very new graph I’ve developed on the theme of Velocity. I hope it, too, will lead to lots of interesting discussions with customers.

But the reason for sharing it with you is that you might well want to do something similar.

The primary purpose of the graph is twofold:

To show the hierarchy of importances and velocities.
To show when too many WLM service class periods are too similar – both in terms of velocities and importances.

As I write this those two bullets look remarkably similar but they’re not.

The Importance Of Importance

Question: Given two service class periods with equal velocities, which will be served first?

Answer: The one with the higher importance.

It’s a fact that importance is the major distinction in that WLM will try to satisfy the goal of a more important service class period before attempting to satisfy the goal of a less important one.

But note that some goals are unattainable and some velocity calculations are dominated by things WLM can do little about.

So this addresses Bullet 1 – the hierarchy.

The Importance Of Being Earnest

Sorry for the gratuitous section heading but it sort of fits: If you have a goal that’s greatly overachieved it’s not protective. For example, if STCHI has a goal of 40% and it always achieves around 80% it’s not protective: If CPU becomes scarce, as just one scenario, the velocity delivered could easily drop down towards the goal 40%.

So set goals that are “in earnest” and protective, unless you really don’t care if service drops off.

Flight Level Separation

But importance isn’t the only, ahem, important 🙂 thing: The difference in goal velocities is also important: Goals that are too close together aren’t really useful.

If possible keep velocities at least 10 apart, or try to merge the service class periods.[2]

So this addresses Bullet 2 – the separation.

And Now To The Graph Itself

The graph I’m describing in this post is also a nice rhetorical device.

The graph has along the x axis WLM importance. On the y axis is the goal velocity. Each marker is a unique combination of importance and velocity. Next to the marker is a list of service class periods defined with that importance and velocity.

At least that was the first implementation.

Then I refined it (and I’m still fiddling with it):

If the service class period consumes significant CPU I bold and enlarge the name. If it doesn’t I draw the name in italics. So there’s a distinction between significant service class periods and insignificant ones – for this shift and this system. “Significant” is a “movable feast” so I expect I’ll tweak this in the future.
If there are more than 3 service class periods with the same importance and velocity I don’t list them but the label becomes e.g. “7 SC Periods”. It’s significant if many service class periods share the same importance and velocity.
I use colour coding for the service class periods – instead of adding e.g. “.1” to denote first period to the label. (I’m having to get fancy to manage the label “real estate”.)

OK. So enough prose; Let’s see some pictures. 🙂

Here are some example graphs. I’ve scaled them down to fit into the page column so you’ll find them clearer if you open the picture in a new tab or window.

First a straightforward one.

velocityI

In this example there are four significant service classes: SERVERS, PRDBATHI, STCMD and PRDBATMD. Here SERVERS (assuming it has the right things in it [3]) is sensibly located to the right and above everything else. STCMD and PRDBATHI are together in the middle and PRDBATMD is down and to the right.

This looks like a sensible hierarchy and generally the velocity “flight levels” have good separation.[4].

You’ll also notice a couple of (in black) Period 2 data points. Period 1 for these service classes have response time goals.

Now a case where the flight levels are too close together:

velocityK

Importance 2 and, even more so, Importances 3 and 4 have lots of crowding – with velocity separations down to 2 in some cases. WLM will have a hard time working with this.

Finally a more extreme case:

velocityZ

Here we have several cases where 4 or even 7 service class periods share the same importance and velocity.

Limitations Of The Method

The most obvious limitation is that other goal types – SYSTEM, Response Time and Discretionary – can’t be plotted on the same graph. It would be possible to draw SYSTEM / SYSSTC to the left and Discretionary to the right but it doesn’t add anything.

I’m going to have to think about how to plot Response Time goals – on a separate graph. There isn’t an obvious y axis. By the way, in all three examples there are service classes where the first (or first few) periods have Response Time goals and subsequent ones have Velocity goals. This is often observed – and this graph won’t show these early Response Time periods.

Also this is fairly static – being a “shift” summary.

The real question is “what do I do about flight levels that are too close together?” The ones that are identical might be amenable to combination but you can’t really combine PRDBATHI and STCMD (as in the first example) – unless these service class names are misnomers.

So this is why I consider this graphing technique a “rhetorical device”: I really want customers to think about whether it makes sense to combine service classes. And part of the motivation for this is WLM works better when the work in a service class period is sizeable.

This is also a “single system” graph and the constraints of running in a Parallel Sysplex – where there is only one WLM policy in effect for all members – aren’t reflected here. Again, doing the thinking is the important thing.

One, perhaps subtle, issue is the fact RMF records CPU in the Service Class Period where the work ended. You can see this for BATCHMED in the second example:

Periods 1 and 2 have little CPU in them; The name is italicised.
Period 3 has CPU in it; The name is in bold. Clearly work accumulates service (which has to include CPU) when it progresses through the periods. But there isn’t a good way to back-calculate the CPU in each period.

Conclusion

So I hope this graph gives you some ideas. Certainly I’ll be using it in customer situations and it’s a very easy graph for me to produce[5]. It will, of course, evolve – in all likelihood. For example you can see cases where the labels are either cut off or overlap something else.

When I use the term “rhetorical device” I mean the graph is useful but not to be taken too seriously: It should usefully contribute to the discussion, warts and all. ↩
This, as we shall see presently, is easier said than done. ↩
You can tell (mostly) what’s in a Service Class using SMF 30: Workload, Service Class and Report Class are fields in the record. ↩
The more I use the term “flight level” the more I like it. ↩
It’s actually written in PHP which generates javascript. This in turn draws on an HTML5 Canvas element. In most browsers you can readily save the javascript and indeed the drawing as a PNG file. Actually I think browsers have a slightly awkward handling of Canvas elements – but nevermind. (If I, to paraphrase the late great Tony Benn “retire to spend more time doing real computing” 🙂 I fancy I might be working on this.) ↩

System Virtual Storage

Address Space Virtual Storage

Conclusion

RACF IRRXCF_* Structures

A Case In Point

A Happy Ending

Machine-Level Memory Allocation

Coupling Facility Memory

Making Pies

Other Posts on Coupling Facility You Might Like

Code To Analyse SMF Timestamps

A “Well Behaved” Case

A Less Tidy Case

Conclusion

Allows Scaling Like DFSMS

Much Simpler Than ICS / IPS / OPT

Can Manage Newer Stuff

Can Manage Stuff “Properly”

Conclusion

Some Of Why I Care About Web Technologies

Step Forward HTML5 Canvas

Insufficiently Clever By Half?[6]

So What?

Again, So What?

My Prototype

DFHSM

DFRMM

CATALOG

Data Extract / Transmission Address Space

SMF Interval Accuracy

Events Dear Boy, Events

Other Address Spaces And Metrics

Timer Pops And Movable Feasts

Conclusion

The Importance Of Importance

The Importance Of Being Earnest

Flight Level Separation

And Now To The Graph Itself

Limitations Of The Method

Conclusion