Mainframe, Performance, Topics

Peggy Zagelow’s Blog

(Originally posted 2007-11-20.)

I’m very pleased to note Peggy also has a developerWorks blog. She’s one of the leading lights in DB2 Development.

Her blog is here

Memories of Batch LSR

(Originally posted 2007-11-18.)

Another trip down memory lane – prompted by the thread on IBM-MAIN of Batch LSR vs Hiperbatch.

Batch LSR started out as a prototype by Paul Dorn of the IBM Washington Systems Center. He was contributing to a book on writing a subsystem but also trying to solve a problem with Batch VSAM tuning. So Batch LSR started out as an example of a subsystem.

(I don’t recall a rash of subsystems written by users or vendors after that book was published. 😦 But BatchPipes/MVS would later be built as one.)

Now to the problem Batch LSR (hereafter referred to as BLSR was designed to solve, back in the 1980s…

If a batch job accessed a VSAM file it would almost always use VSAM NSR, which was (or rather could be) optimised for sequential access. If, however, the access was random^[1] NSR would be pretty hopeless. To use the random-oriented VSAM LSR would require the program to create its own VSAM LSR buffer pool(s) using the BLDVRP macro. Yes, that’s right, Assembler. 🙂 Most batch jobs were never going to be written that way.

So Paul wrote (and Poughkeepsie rewrote) the subsystem. Here’s what it does…

The subsystem creates the VSAM LSR buffer pools for you. And it allows you to specify a number of parameters, such as the location of the buffer pools (above or below the 24-bit line) and their size. It’s very easy to set up and easy to code the JCL changes needed to use it. (In a large number of places in our VSAM-based (SLR) analysis code we’ve done this.)

Here’s a case Roger Fowler and I worked on in the mid-1990s:

A UK customer had a batch job that did 2 million I/Os to an 850 CI data set. Roger said “let’s turn on BLSR”. They did and the job went down to 0.5 million I/Os. So I said “let’s turn on the Deferred Write option”. Roger said “the what?” 🙂 Anyhow we turned it on and the job went down to around 1,000 I/Os. From 2 million down to 1 thousand is pretty good, I’d say. 🙂

There was a tool called BLSRAID which was SAS-based. We didn’t have SAS so we wrote our own VSAM analysis code in 1993. Another piece in the jigsaw that is the PMIO toolset. (Roger and I used this report to make that saving.)

Another tool – for when you have enabled BLSR for a data set / job – is the User F61 GTF trace. This fed into a popular tool – VLBPAA – but you wouldn’t want to run the trace for all that long. In the back of the SG24-2557 “Parallel Sysplex Batch Performance” Red Book I wrote an appendix on playing with this trace. You can use this trace (and could use VLBPAA) to model the effects of big buffer pools.

In principle, as my and Roger’s example showed, you could fit the whole data set into memory. In fact we recommended BUFND=1024 to this customer. Slightly wasteful but a nice round number. (1024 4KB buffers is, of course, 4MB.)

There were many cases down the years where BLSR was a good fit. But quite often we made the recommendation to stick with NSR and tune for sequential access instead. The basic VSAM instrumentation – SMF 62 and 64 – would lead us to robust conclusions. My good friend Dave Betten (now Performance Lead for DFSORT) put together the Type 30, 62 and 64 records in one SLR table. And we regularly used the SMF 42-6 records (designed by Jeff Berger and eulogised by our good friend John Burg) to round out the VSAM data set picture.

Of course in 1997, with OS/390 Release 4 DFSMS, System-Managed Buffering (SMB) came along. This made it much easier to optimise for sequential or random access, using DFSMS constructs. I’m not sure how widely this is used – but it’s worth taking a look at.

Of course DB2, suitably managed, has many I/O strategies. But for VSAM you have to do it yourself. And the Batch LSR Subsystem made it possible for “random” VSAM I/O.

^[1] When I say “random” I really mean “direct”. I’m not sure I believe in randomness, particularly in data processing. The point is that there is “dotting about” rather than sequential access.

DYNDISP and DDF?

(Originally posted 2007-11-15.)

To whoever it is in Denmark that’s repeatedly hitting my blog today with a search for “dyndisp” and “ddf” I’d love to know what it is that is causing you to search for those two terms together. As you’re on my (day job) patch perhaps I should be talking to you directly.

So feel free to contact me by clicking on this link. (I’m hoping the “mailto:” URL form works on your system.)

And, yes, I am paying attention to the search traffic that comes to my blog. Big Brother? I hope you don’t think so.

z/OS R.9 RMF Parallel Sysplex New Fields

(Originally posted 2007-11-12.)

I’ve just re-read the XCF (74-2) and CF (74-4) sections of the z/OS Release 9 SMF Manual. There are some nice things in there…

XCF (74-2)

Number of buffer 1KB blocks by path (R742PUSE). This should tell you whether memory waste by over-specifying CLASSLEN is a problem. And some of the other scenarios.
Member job name (R742MJOB). I’d like to believe this would tell us which IRLM a given member is, for example.

Coupling Facility (74-4)

Whether this CF LPAR is using Dynamic Dispatch (R744FFLG bit 3)
How many shared processors (R744FPSN) and how many dedicated processors (R744FPDN) this CF LPAR has.
Structure Execution Time (R744SETM) which requires CFLEVEL 15. Useful for Coupling Facility capacity planning.
Whether this processor is dedicated (R744PTYP) and what its weight is if it’s shared (R744PWGT).

I haven’t actually looked at any sample R.9 data yet. I’ll have to put that to rights and get back to you when I see some of these numbers in action. But I do think they extend the data model rather nicely, particularly the CF CPU stuff.

Has anyone seen this data in action yet?

On Further Investigation…

I have found a set of 1.9 data. And here’s what I can immediately confirm…

R742MJOB is a very usable job name. Here’s an example…

From ERBSHOW…

#38:   +0000:  E2E8E2C2 40404040 C9E7C3D3 D6F0F0F4  *SYSB    IXCLO004
       +0010:  D4F8F040 40404040 40404040 40404040  *M80             
       +0020:  00030000 0000001A 00000016 00000000  *                
       +0030:  C7D9E240 40404040                    *GRS

Whereas before we had an anonymous member (M80) and an anonymous lock structure (IXCL0004) we now know that M80 is SYSB’s GRS address space and therefore that IXCL0004 is GRS Star. Other Lock Structure exploiters similarly fall into place. I think this will be more interesting for eg IRLM.

Memories of Hiperbatch

(Originally posted 2007-11-11.)

It’s nice to see a flurry of activity in IBM-MAIN about Hiperbatch. And it’s more for the emotional reason of reminiscence than for any stunning insights that I’m blogging about it…

Back in 1988 I ran a technical project in my then customer, Lloyds Bank, to evaluate Data In Memory (DIM). It was a fun project with a wide range of workloads on multiple machines, including the then-new DB2. So we tried out all the analysis tools and even did a bit of Roll Your Own…

Through the IBM internal FORUMs I met another IBM Systems Engineer, John O’Connell, doing a similar thing for his customer, Pratt and Whitney in Connecticut. And he wrote some SAS code to evaluate VIO to Expanded Storage. We shared this code with Lloyds Bank. (I ran into him at several conferences. And later I discovered he’d left IBM and, I think, gone to work for a customer. Are you out there, John?)

And I wrote a presentation on VIO to Expanded Storage (called VIOTOES – which sounds funny if you pronounce it right). 🙂 There were two key elements in this presentation:

How to evaluate the opportunities for VIO. And what happened if you fiddled with the VIOMAXSIZE tuning knob (the maximum (primary plus 15 secondary extents) temporary data set size that would end up as VIO (and hence potentially in Expanded Storage)).
How to use the then new-fangled DFSMS ACS routines to control which data sets were even considered for VIO – and all sorts of other funky tricks with DFSMS.

This presentation went down well with IBMers and customers alike and could be considered my first conference presentation.

The point of the above is to set the scene for Hiperbatch…

So, we announce Hiperbatch as part of MVS/ESA 3.1.3. (Funny how we went 3.1.0, 3.1.0e but not 3.1.1 or 3.1.2.) And it had a hardware prerequisite of a 3090 S processor (because of the MVPG instruction – even though technically one COULD move pages between Central and Expanded Storage using ordinary instructions if we’d chosen to implement it that way.) The important thing is that we wanted an exclusive by tying this super duper new facility to a brand new processor.

And because of this my fourth line manager at the time decided we all had to run Hiperbatch Aid (HBAID) studies. At this point I learnt I was not a “team player”. Well duh. 🙂 I declared I wasn’t going to do it because we’d already crawled all over Lloyds Bank looking for genuine DIM benefit. And there wasn’t likely to be any from Hiperbatch. With that defence the requirement to run HBAID was waived.

You’d think from that I’d a downer on Hiperbatch and HBAID, wouldn’t you? 🙂 Far from it actually…

I did enjoy running HBAID in one or two other customers and I did get quite creative with Hiperbatch. And that was the trick, in my opinion – getting creative. And that realisation led on to other things…

One of the really nice things about HBAID was that it Gantt’ed out data set lives. And from that you could glimpse where other techniques might be useful (such as VIO and OUTFIL). So I invented^[1] a technique called LOADS which stood for (for those of you averse to “flyovers”) 🙂 “Life Of A Data Set”. These signatures were really rather handy. Here’s a slightly later example…

A “standard” BatchPipes/MVS pipe candidate would be a sequential writer job followed by a sequential reader job for the same data set. There are, of course, lots of scenarios where Pipes is useful, each with their own signatures (LOADS).

And it was my good friend Ted Blank who told me about Pipes in late 1990. (And he had been involved in HBAID.)

Ted also encouraged me to start writing a book on Batch Performance. This later became SG24-2557 “Parallel Sysplex Batch Performance” and I believe you can still find it online. (Only this week I referred someone to it as a starter manual for what he wanted to do – but I DO regard it as being somewhat dated.) 😦

And the writing of the book got me into writing batch tools – which generalised what HBAID did and then some. And that’s how I came to be one of the developers of PMIO , the Batch Window analysis toolset / consulting offering. You may have heard of PMIO.

I’m aware of very few installations running Hiperbatch and even fewer running Pipes. 😦 But at least I got something out of it. And we did evolve the “state of the art” as far as batch window was concerned.

As for Hiperbatch’s applicability. I think it’s worth a look at. But there are so many other techniques around that more or less cover the same ground. Though some, like Pipes, cost money. And others are extremely creative to apply. But I don’t think it was ever going to take off in a big way. But that’s OK, given Hiperbatch was built to solve a problem one important customer.

^[1]Actually I don’t claim to have invented it, just popularised and generalised it. After all HBAID itself was doing much the same thing. In fact someone once suggested I should patent LOADS but I declined on the “not exactly oriiginal work” basis.

WLM-Managed DB2 Stored Procedure Address Spaces

(Originally posted 2007-11-09.)

I was contacted by the team updating the SG24-7083 “DB2 Stored Procedures: Through The Call And Beyond” this past week. Their question was quite straightforward:

“One of the statements in the book, in the chapter on WLM address space management states:

To help analyze the use of resources by different types of stored procedures, you should name the server address spaces in such a way that it is clear which Application Environment they serve. With this naming convention SMF Type 30 Subtypes 2 and 3, Accounting Interval records can be used to determine the resource consumption by each server address space. These records include CPU time and virtual storage usage.

Do you have any details on what this might mean?”

As I wrote that chapter I guess I do. 🙂

Here’s the gist of my reply. I think it’s worth sharing with you:

First a little background to what I was talking about…

You can observe the starting and stopping of WLM SP server address spaces – using SMF Type 30. Job end/step end subtypes 4 and 5. And also you can count the address spaces with a given name using the subtypes 2 and 3. This gives warm fuzzies or cold spikeys 🙂 about how the WLM-starting-and-stopping business is working out.

And, further, you can see the “weight” of the address space – in terms of (non-DB2) I/O, memory and CPU from the Type 30 records. Recall the “weight” feeds into WLM decisions about whether it can afford to start another address space that services the same queue.

Here’s why I wrote what I wrote.. Given the address spaces each serve one and only one Application Environment and one and only one WLM Service Class it would be VERY nice to be able to say things like “This AE / SC had some reluctance to start another address space because the EXISTING ones servicing this queue are too darned heavy”. To do that you need to identify the queue related to the address space your looking at. Hence a good address space naming convention is the ONLY thing that will enable you to do it.

There is obviously more detail one could add to this…

The program name a WLM DB2 Stored Procedures Server Address Space runs under is “DSN9XWLM”. (Other types of WLM Server Address Space will have different names but the same basic “trick” will still work.)

One could also worry about Virtual Storage above and below the 16MB line in such an address space. It’s a fact that some kinds of DB2 Stored Procedure are a bit of a challenge in virtual storage terms. Recall: You can write a DB2 Stored Procedure or UDF using almost any programming language or tools you’d ever want to.

If you had only one AE and one SC for work using DB2 Stored Procedures you’d not care about which address spaces were which. But the whole flippin’ point of Stored Procedures is that they enable common application logic across DDF, CICS, Batch, Websphere, etc.

And finally, a tiny piece of background as to why I focused on this issue in 2003 – when the original Redbook was written…

A major customer whom I greatly admire had had some difficulties with getting nested Stored Procedures (and UDFs) to perform. This was largely down to WLM not wanting to start new server address spaces. I believe the problems are well and truly behind them. But it was a very interesting technical area to get into. And, as usual, I dug into the instrumentation to see if we could shed some light on the matter.

So that’s how I came to write the chapter in the Redbook – given the 1.5 weeks IBM Global Services so graciously allowed me to spend on the project – on WLM DB2 Server Address Space Management.

And now some diligent people are working on a new version of the Redbook – and I’m pleased they’re asking me “what on earth did you mean by…?” 🙂

GSE Conference 2007

(Originally posted 2007-11-05.)

Last week I attended GSE Conference for the first time in a long while. And I’m very glad I did.

Let’s get the egotistical bit out of the way first…

I very much enjoyed presenting Memory Matters in 2008. If you’ve seen me present it before you might notice some minor tweakings. We learn as we go, don’t we. 🙂 As usual the challenge is “too much material”.

I attended a number of other presentations. For me the highlights were:

Fabio Massimo Ottaviani’s presentation on zIIPs and zAAPs. This gave me some new ideas for charts – though I think I have to get the hang of zIIP/zAAP normalisation factors before I can implement them.
Mike Duffy gave a good presentation on Sysplex Distributor.
John Campbell was great value, demonstrating (in a painfully practical way) 🙂 what it takes to deliver High Availability. “Don’t trust a machine you can lift” comes to mind. 🙂 His proper topic was “DB2 Availability”.
Scott Drummond took Bob Rogers’ (or is it Harry Yudenfriend’s?) 🙂 foils on “More Data” and ran with them. Actually I think I might steal the foils off all of them and add my own touches – probably to add in some DB2 perspectives. This would be a 3-way “crossover” presentation: z/OS / z9, DB2 and DFSMS / DS8000.

And it was great to run into many customer and vendor friends. Thanks to BMC for the pretty t-shirt 🙂 and to BluePhoenix for the USB hub. And it was nice to spend time with most of my unit and my manager Gerry. Almost a team meeting.

Anyhow enjoy the foils and (perhaps) tell me what you think of them. They will evolve.

Hackday4 and Referer URLs

(Originally posted 2007-10-27.)

I’m going to have to stop adding “(sic)” after every use of the word “Referer”. As in “Referer URL”.

The term itself comes from one of the standard HTTP headers. And therefore is somewhat fixed. Despite giving my “wetware spell checker” a pink fit on a frequent basis. 🙂

Anyhow, in this blog entry I talked about how I can – as standard – get a display of the URLs people come from to land on my blog. (And in a comment to the entry I said I’d put up (using a standard Roller macro) the list of such Referer URLs).

Well, yesterday was IBM’s fourth Hackday (known as Hackday4). The idea came from Yahoo – who’ve been doing a similar thing for a long time. IBM has had 4, spread over the last 18 months. I’ve participated in ALL of them. I have – pretty much permanently – tons of hacking ideas swirling around inside my head. THIS was not the first I had for this particular Hackday.

So, I took my firefox extension – unfortunately only likely to be available internally – and taught it a new trick…

When a developerWorks blogger is at their “Referer URLs” page it takes the list of URLs and analyses the ones that came from Google. (I may add Yahoo if I get a significant number of hits). If the hit from Google was a search I take 2 things from the URL:

The country (as part of the domain name, with the assumption that “www.google.com” is the USA).
The search terms used.

I do some counting and display the country list. Likewise the search terms list. The search terms list is a bit trickier as you can, for example get searches with “zos”, “z/OS”, “z/os” etc. TODAY I don’t take the slash out and assume they’re all the same thing. But I do assume “z/OS” is the same as “z/os”. And I do make some attempt to recognise when there’s an acronym…

If I see “icf” and then later on “ICF” the search term becomes “ICF” and the original (mixed case) version is discarded.

People sometimes put quotes in searches. Today I don’t handle that.

It’s been a fun “time-limited” hack. It would be nice to do more with it. And perhaps to find some other Roller-based blogging site that I could test it with. Then maybe I can ship an EXTERNAL firefox extension to the users of that.

And what does this all buy?

Basically, given that over 90% of my hits are direct (which probably mainly are “spiders”) it’s NOT that statistically significant. But it does tell me something about where in the world my readership is located. And also something about the things people are searching for when they stumble across my blog. So maybe what to write more about.

It’ll be interesting to see if any of the “webby” terms in THIS entry show up in the list.

More on Coupling Facility Async / Sync Thresholds

(Originally posted 2007-10-17.)

Following on from Coupling Facility Async / Sync Thresholds – They Are A’Changin’ I’ve been informed by Development there is a new improved write up on how the Dynamic Sync/Async conversion works in Chapter 6 of the z/OS Release 9: Setting Up A Sysplex manual. I’ve read it and it is VERY good.

One thing to pull out is that there isn’t just one threshold… There are different thresholds for Lock vs List and Cache structures, for Duplex versus Simplex (aka Non-Duplex), and by machine. The thresholds that have increased are not the ones for the Duplex case. Those stayed the same. (The thing that got me sight of the new documentation was asking what effect Duplexing had on the thresholds, by the way.)

Not coincidentally I was in a customer yesterday planning how we are going to measure a test with DB2 Data Sharing at a distance of over 20km. Part of that will be experimenting with turning off System-Managed Duplexing but not (we think) User-Managed Duplexing (aka GBP Duplexing). Without in any way betraying confidences I’ll see if I can write up the lessons we’ll have learned over the coming weeks and months.

So do take a look at the new description. It is really very good.

Coupling Facility Async / Sync Thresholds – They Are A’Changin’

(Originally posted 2007-10-13.)

APAR OA21635 is one of a rare breed: A change to the thresholds for z/OS’s automatic CF Request conversion. (I’m told this has only happened once before.)

If you recall z/OS Release 2 (in 2001) introduced a very nice function that automatically converts Coupling Facility requests from Sync to Async, based on thresholds. The purpose of this is to minimise the CPU cost for a coupled z/OS system, while still providing reasonable request response times:

With a synchronous request the requesting z/OS CPU (engine) actively spins, waiting for the request to complete. So a longer request service time would lead to a higher CPU cost for the coupled z/OS system.
With an asynchronous request the requester does not spin. But the response time is longer.

When I say response time I mean CF request response time, which may have little to do with actual application response times. But the two aren’t totally divorced.

(Requests that were originally Async don’t get converted to Sync, by the way.)

The big change came, as I say, in z/OS Release 2 where XES introduced a new algorithm, measuring response times and, based on thresholds, deciding whether to convert Sync requests to Async. This measurement is not done for every request, but rather once in a while. So we don’t have “nervous kitten” syndrome here. 🙂 I like algorithms that are responsive. I don’t like algorithms that are overly jumpy.

So, back to the thresholds:

Technology changes, so it’s appropriate to revisit the thresholds from time to time. APAR OA21635 is a result of this. To quote Development:

The synch/asynch thresholds have been recalibrated to better reflect current z/OS and processor technology. After installing the APAR, installations may see some asynchronous CF activity shift back to synchronous operations resulting in slightly improved performance for applications sensitive to these CF operations.

What you’ll see when you install this PTF varies, depending on your situation:

For requests using an IC link – obviously to the same-footprint – nothing’s going to change as all requests are going to be Sync unless the request was explicitly Async. Requests using an ICB link (to a nearby footprint are probably broadly similar.
For long-distance requests again nothing’s likely to change as the new thresholds would again suggest Async conversion as the norm.
For middle-distance (say less than 2km) links (ISC) there may be some change: More requests are likely to go Sync rather than Async. For some customers this category has been of concern. The shift in thresholds is meant to address it.

An interesting game I like to play is to compare request types to “local” and “remote” CFs, particularly with Duplexing. And, obviously, the response times seen. The simple case is without duplexing where “local” requesters get Sync requests in the main – at perhaps less than 50ms. And “remote” requesters get essentially Async requests, with a much higher response time, especially at distance.

Distance discussions are well informed by such analysis. As are Duplexing discussions. Which is, perhaps, the essential “take home” message of this post.

When implementing the threshold changes I would form a view of such things before and after applying the PTF. And I’d be interested to know how well it worked for you.

And remember there’s nothing essentially good or bad about Sync or Async. It all depends on whether the behaviour is appropriate for your scenario.

But I’m pleased that, once in a while, the thresholds are revisited. So I think this is a good APAR.