(Originally posted 2011–04–25.)
I concluded Batch Architecture – Part One with a brief mention of inter-relationships and data. I’d like to expand on that on this part.
Often the inter-relationships between applications are data driven – which is why I’m linking the two in this post (and in my thinking). But let’s think about the inter-relationships that matter. There are four levels:
- Between applications.
- Between jobs.
- Between steps in a job.
- Between phases in a step.
The first three are well understood, I think. The fourth is something I explored last year. Before I talk about it let me talk about “LOADS” – which I mentioned in Memories of Hiperbatch.
(And a minor note on terminology: Yes I KNOW that OPEN and CLOSE are macros. I don’t intend to use the capitalisation here – because the act of opening and closing a data set is meaningful, too (and less grating to read). Forgive me if this “sloppiness” offends.) ")
Life Of A Data Set (LOADS)
I won’t claim to have invented this technique. (As I said in “Memories of Hiperbatch” I declined an offer to write up a patent application because I knew I hadn’t originated it.) But I do advocate its use quite a bit. Here’s an (oft-used) example:
If you have a single-step job that writes a sequential data set and another that reads it (both from start to finish) there’s a characteristic data set “signature”: Two opens, one after the other, one for update, one for read. If you discern this pattern you might think “BatchPipes/MVS”. (Depending on other factors you might think other things – such as VIO.)
So this is a powerful technique.
LOADS Of Dependencies ")
In 1993 we wrote code to list the life of each data set a job opened and closed. Not long after that we got tired of figuring out dependencies by hand from LOADS. ") So we fixed it:
At its simplest a writer followed by a writer indicates a (“WW”) dependency. A writer followed by a reader indicates a (“WR”) dependency, also. And so on.
Pragmatically some of these dependencies aren’t real, or at least it isn’t as simple as this sounds. For example:
This says nothing about PDS members.
GDGs are a little different.
A writer one morning and a reader the same evening might not be marked as a dependency in the batch scheduler (though it probably ought to be). To at least alert the analyst (mainly me these days) to this sort of thing the code pumps out the time lag between the upstream close and downstream opens. (This is an enhancement I made, together with some more “eyecatcher” things with timestamps last year.)
What’s the key here? Do we include volser?
But you can see there’s lots of merit in the technique, even with these wrinkles.
As I said before application-level, job-level (in some ways the same thing) and step-level dependencies are things we’ve known about for a long time. Also we’ve know about DFSORT (and other sort) phases for a long time, too: Input, Intermediate-Merge and Output phases. These should be familiar, although people tend to forget about the possibility of an intermediate merge phase – because it should only apply to large sorts.
So, if sorts have phases, what about other steps? Last year I enhanced the code to create Gantt charts for data set opens and closes within a step. In many cases jobs became no more interesting because of it. But in a number of cases fine structure appeared: Non-sort steps demonstrably had phases. In one example a step that read a pair of data sets in parallel wrote to a succession of output data sets. I could see this from the open and close timestamps of the output data sets. (Without looking at the source code I couldn’t be sure but maybe there’s some mileage in dissecting this step.)
It’s in my code: If it applies to your jobs I’ll be sure to tell you about it.
An Application And Its Data
Apart from the small matter of scale figuring out which data an application uses is the same problem as figuring out which data a job uses.
I think I’ll talk about DB2 in a later post, as this one has already become lengthy.
As you probably know there is lots of instrumentation on data sets in SMF. Without going into a lot of repetitive description:
- You can get information about disk data sets from SMF 42 Subtype 6.
- You can get information about VSAM data sets from SMF 62 (open) and SMF 64 (close).
- For non-VSAM it’s SMF 14 (for read) and 15 (for update).
There are a number of lines of enquiry you might like to pursue, including:
Working out which data sets contribute most to the application’s processing time.
Here you’d use SMF 42 and something like I/O number or (more usefully) I/O number times Response Time.
Figuring out which data sets are strongly related to this application and no other.
In this case SMF 14, 15, 62 and 64 are needed. (You don’t need both 62 and 64 for the same data set.)
None of the above applies to DB2: You don’t get 14, 15, 62 or 64 for DB2 data (despite DB2 using Linear Data Sets, a form of VSAM). But there is useful work you can do on DB2 data classification. And that is the subject of the next post in this series.