DB2 Virtual Storage Goes Mainstream

(Originally posted 2005-06-13.)

For a number of years now we’ve been talking about the importance of DB2 Virtual Storage management. The real credit goes to John Campbell of the DB2 lab who pioneered this stuff.

We usually see DB2 Virtual Storage as an issue in 2 ways:

  • A client presents with the problem and asks for our help in resolving it.
  • We’re looking at a client’s data for some other reason and ascertain they need to look at Virtual Storage as well.

So I usually give the following Rule of Thumb: “If your DB2 Virtual Storage usage (as seen by SMF 30) exceeds 1GB, or your buffer pools exceed 512MB you probably need to pay attention to Virtual Storage.”

The main instrumentation for studying DB2 Virtual Storage is IFCID 225, produced by turning on Statistics Trace 6 and collecting SMF 102 records.

This record allows you to build a map of the DBM1 address space. (Actually it’s technically not a map but it does show the major subdivisions of memory.) We like to do a stacked bar through time (AKA “With Varying Load”). Then you can see the dynamics and work on reducing the major areas.

Today’s news is that, with APAR PQ99658 for Version 8 and (I think) Version 7, IFCID 225 is produced whenever Statistics Trace Class 1 is turned on. Which is pretty much always.

This corroborates my view that every DB2 subsystem should have it turned on all the time. It’s light and no-one has ever complained to me about turning it on.

Another change with this APAR is to drop the default value of STATIME from 30 minutes to a (rather more useful) 5 minutes. But note this means that you get more records. Yippee or ouch, depending on your perspective. I’m in the yippee camp on this one.

Finally the APAR says a cache of IFCID 225 data is kept internally so a dump can show the changes over time. Again a great help if a subsystem runs out of storage.

Learn from my mistakes – Which Engines Does IRD Vary Offline?

(Originally posted 2005-06-13.)

I’d made the assumption (in my code) that IRD would vary offline the top logical engines when it decided to vary an LPAR’s logical engine offline.

In fact that isn’t the case. I have data from a US company where it’s engines 2 and 3 (out of the range 0-11) that tend to be offline more than all the others.

In principle even Logical CP 0 can be offline. (Indeed some of the test systems in Poughkeepsie have no Logical CP 0.) Now this was a problem for us because I used the “Online Time” (SMF70ONT) for Logical CP 0 to determine the RMF interval length in STCK format. I use that to determine how much of the interval any Logical CP is online (using their SMF70ONT). In theory I could be dividing by 0. In practice I haven’t seen that.

Now APAR OA05798 documents a change in the order in which Logical CPs are varied offline. It talks about starting with CP 1, not CP 0.

But it does show how the implementation has already evolved and how wrong basic assumptions can be. 😦

Now I feel a new PMMVS chart coming on. 🙂 I have several times now plotted how much of the time each logical CP is online. This is, at this stage, of pedagogical interest. But clients do ask me to explain how IRD works. It is counter-intuitive.

UKCMG Conference

(Originally posted 2005-05-25.)

This is mostly to say a big thank you to all the customers and other consultants at UKCMG (which has just ended). It’s great to catch up with you all. And also to spend time with Don Deese and Mike Moroz.

>

I genuinely believe in user groups and so I’d like to encourage everyone to attend next year. And, just to be balanced, to support GSE as well – I get to go to both conferences and count myself as lucky to do so.

DB2 Accounting Trace Parallel Task Rollup – A Problem Child?

(Originally posted 2005-05-09.)

Here’s another breakage of Accounting Trace that’s been fixed:

APAR PK03905 describes a problem with DB2 Parallel Task Rollup that causes the Group Buffer Pool Accounting data for GBP 0 to be missing. (Its title talks about IFCID 148 but in the detail IFCID 3 is roped in. IFCID 3 means “Accounting Trace base record”.)

Of interest to performance people in DB2 Stored Procedures environments

(Originally posted 2005-05-04.)

I just saw the following APAR close. PQ99525 fixes a couple of problems for “nested” DB2 access i.e stored procedures, triggers and user-defined functions (UDFs).

For anyone thinking “this is not my shop” I think you should consider that these functions are appearing in many modern applications and your Applications people probably won’t think to tell you when they start using them.

To quote from the APAR description:

  1. Class 1 accounting logic for UDFs, stored procedures and triggers could capture and record small fractions of class 2 in db2 time. This would result in QWACASC and QWACAJST having non-zero values when class 2 accounting was NOT active.
  2. UDFs and stored procedure require in db2 time to connect and disconnect UDF or stored procedure tasks to db2. This time was not being accounting for in class 2 in db2 times (QWACSPTT, QWACSPEB, QWACUDTT, QWACUDEB). Class 3 suspension time is clocked during this connect and disconnect processing and thus class 3 time could be significantly greater than class 2 time.

It’s hard enough keeping track of the nested time without problems like this. We described how to do this in the “Stored Procedures: Through The Call And Beyond” Red Book.

ESCON vs FICON and LPAR IDs

(Originally posted 2005-05-03.)

My thanks to Greg Dyck for pointing out the following on IBM-MAIN:

“The Store-CPU-Identifier now allows for a 8 bit partition identifier but the ESCON I/O protocols only allow for a 4 bit identifier. This is the reason that multiple channel subsystems must be implemented if you want more than 15 partitions on a CEC. FICON does not have this limitation.”

And to think ESCON’s only 15 years old. 🙂 I remember at the time having the EMIF support explained to me – including the LPAR number field. At the time 15 LPARs seemed an awfully large number, particularly as people were still grappling with how to configure machines with 2 or 3 LPARs for performance.

LLA and Measuring its use of VLF

(Originally posted 2005-04-28.)

I’m reminded by a question on MXG-L Listserver that many people don’t understand how LLA works – and in particular how to interpret the statistics in VLF’s SMF 41 Subtype 3 record.

You really have to understand the exploiter to make sense of the statistics. Here’s how it applies to LLA…

LLA (when it was Linklist Lookaside, prior to becoming Library Lookaside) could cache load library directories in its own address space. You get no statistics for this behaviour. 😦

Then in December 1988, when MVS/ESA 3.1.0e was released, LLA got renamed to “Library Lookaside”. The key difference was that you could (selectively) enable exploitation of VLF cacheing of modules. Compare this with “XA” LLA which only cached directories. You enabled module cacheing by specifying in your COFVLFxx member a new class (CSVLLA) with an EMAJ of LLA and a MAXVIRT (which defaulted to 4096 pages or 16MB). 16MB was large in 1988. Now it’s puny.

Now here’s why the VLF statistics are deemed “meaningless” for (CSV)LLA: LLA only asks VLF for something if it knows it’s going to get it. So you always get something close to 100% hits with LLA’s exploitation of VLF. But you probably would regard the “successful find” rate as a reasonable metric of benefit.

It’s much better in my opinion to look at the load library EXCPs or SSCHs. After all those are what you’re trying to get rid of. I know this is old news but a few years ago I got into a dialogue with the LLA component owners. They suggested that installations start with a VLF specification of 128MB for LLA – and then worked up from there.

So that’s what we do – when it’s deemed relevant to look at this sort of thing. (I wrote this particular piece of our code in 1994.)

Coupling Facility Capacity Planning

(Originally posted 2005-04-26.)

A friend of mine from the ITSO in Poughkeepsie is contemplating a Red Paper on Coupling Facility Capacity Planning. I think this is a splendid idea – and would be willing to contribute myself.

But I’m wondering what y’all think of the idea. And of what you’d like to see in such a Red Paper.

(Red Papers are less formal than Red Books – which we all know and love.)

Innsbruck z/OS and Storage Conference – Day 4

(Originally posted 2005-04-14.)

Session TSP11: Performance of the TotalStorage DS8000, Speaker: Lee La Frese

Of course a new disk controller from IBM is going to appear awesome. 🙂 DS8000 is no exception.

Currently 2-way and 4-way POWER5 processors. Statement of Direction for an 8-way machine. POWER5 can go up to a 64-way.

I/O access density has levelled out over the last five years (2000 to 2005), having historically decreased (over the 1980 to 2000 period).

There’s a new adaptive cacheing algorithm (called ARC). Very effective at low hit ratios (for Open). Likely to have less benefit at higher cache hit ratios (eg z/OS workloads).

A small number of customers in the room have FICON Express2. There is a paper from Poughkeepsie on this.

Channel Measurement Block granularity has recently decreased from 128us to 5us. Which actually has been known to cause an increase in response times reported by RMF. But this doesn’t affect Lee’s performance numbers which come from the controller itself and aren’t subject to mainframe processor considerations.

PPRC (Sync Remote Copy) performance has improved dramatically. The big reductions were in ESS. But a 0.38ms cost for DS8000 over 8KM has been measured. This is not really showing what happens at distance. 303km distance showed as about 4ms and 0km as about 0.5ms. Lee noted that you could interpolate reasonably well as the elongation with distance is pretty much linear. The above numbers are at low load. Load doesn’t alter the picture very much – until you hit the knee of the curve.

Session Z16: Sharing Resources in a Parallel Sysplex, Speaker: Joan Kelley

Actually this was rather more about CF performance than actual resources. But it was very useful nonetheless.

Joan talked about shared CFs. Her “classic” scenario is a “Test” CF whose performance is not important sharing with a “Prod” CF where is important. Recall: each CF needs its own Receiver links – so need to configure accordingly.

Delayed requests can even break Duplexing – as a consequence of timing out a request.

Dynamic Dispatch puts the CF to sleep for 20ms at low utilisations, less time for higher utilisations.

A good design point for this sharing case is to turn Dynamic Dispatch on for “Test” and off for “Prod”. (D DYNDISP shows whether Dynamic Dispatch is on for a CF. (Obviously you can also infer that from RMF as well.))

She had a good foil on responsiveness for different “Test”/”Prod” LPAR weight regimes. It showed that at low Test request rates the weights don’t matter much. At higher rates, though, weights of eg 5/95 produce much better responsiveness than eg 30/70. With System-Managed Duplexing you should set weights so that the Prod (duplexed) CF is dispatched 95% of the time – to avoid the timeouts I mentioned earlier.

With Dynamic ICF Expansion the CF can expand to additionally use a shared engine. If the shared engine is totally available – i.e no other LPAR wants it – the performance is close to that additional engine being dedicated.

Because ICF engines, IFL engines and IFA/zAAP engines share the same pool it is possible for an IFL or IFA to acquire cycles at the expense of the ICF LPARs.

There were several foils on CF links. I’m going to have to learn up on this as well: It’s got a lot more complicated. 🙂

Session ZP12: DB2 Memory Management in a 64-Bit World, Speaker: Me

This went reasonably well. I did get one question which was about what value to set MEMLIMIT at (which I think translates into what REGION value to have for the DBM1 address space). At present the answer has to be “I don’t know.” 😦 That’s because I don’t know if Extended High Private users can expand down into memory below the REGION value. If that makes any kind of sense. 🙂 I clearly need to research how REGION interacts with Extended High Private (which is what DB2 uses – mainly).

Session Z20: What’s New in DFSORT?, Speaker:Me

A small but reasonably engaged audience. One good question was essentially “if I’ve set my DFSORT parameters in particular ways years ago which do I need to change now?” Basically it would take a review of the parameters but most of them wouldn’t need to change. But some of newer ones are worth looking at to see how helpful they could be.