(Originally posted 2012-09-18.)
I’ve written extensively in the past about what you can glean about batch suites from SMF, most notably SMF Type 30.
While I don’t believe SMF alone can give you the full dependency network (complete with validation) I’ve just added some analysis to my code that gets me a little closer. As you’re probably never going to run my code the bit that would be interesting is the kind of inferences it’s now drawing. You might want to duplicate and refine them.
I have a standard report for a suite (a group of jobs with a naming convention) that lists their start and end times (amongst other things). It was written in the mid 1990’s and did just fine for a while. In fact it never broke, it just got underwhelming. 🙂
Looking back at the change log 🙂 I see I made a major enhancement 2 years ago: The code uses Job Identifiers and Reader Start Times to attempt to find jobs released together. Doesn’t work as well as I’d like because it can take more than a second or two for a job to get released so many false negatives occur. But still it reveals (true) stuff.
The latest enhancement attempts to glean something from both start and end times for a stream of jobs:
I’m trying to figure out whether one job follows directly on from the previous one, whether it kicked off in parallel with it, or what.
So here are some tests the code performs, stopping after it satisfies any test.:
- If a job starts no more than 5 seconds after the one above AND it starts before the one above finishes then I consider it starting together with it.
- If a job starts after the one above finishes but no more than 5 seconds after it finishes I consider it a a follow-on job.
- If it starts within the same minute that the one above starts I print an = and not the start time (as that would be tedious and unmnemonic). It might be a co-starter, but it might not.
- Otherwise I print the start time as there’s some sort of a gap1.
If you were to ask “is this rigorous?” I’d have to say “no”.
If you were to ask “is it helpful?” I’d say “yes”:
“In tests” 🙂 it’s illuminated what would otherwise be an impenetrable set of start and end times. In other words it’s helping me edge towards a better understanding of a group of jobs.
Now, suppose my code hinted at a dependency – the “follow-on job” case. It wouldn’t explain the dependency2. There are things in SMF that might explain or corroborate a dependency: SMF 14 and 15 for non-VSAM data sets and SMF 62 and 64 for VSAM data sets can be used to understand some dependencies.3 If I see a “follow-on” case I’d be inclined to look at the jobs involved and their access to data sets. I wouldn’t be nearly so rash as to suggest the lack of “data set” dependencies means the jobs could run together.
In fact this illustrates something fundamental about the nature of batch: It’s very fragile and drawing the wrong conclusions is easy to do – with potentially disasterous consequences. Lots of people have started on batch analysis with the “how hard can this be?” attitude and ended up answering that with “it’s actually very hard”. It’s a steep learning curve but do get a helmet, a rope, crampons etc and start the climb. 🙂
With that climbing analogy I should say the point isn’t to reach the top (I don’t feel I have, for instance) but rather to make progress, lots of progress. In that vein, the algorithm I’ve explained above takes me quite a bit further. I hope you find it useful, and maybe you can provide fresh insight: My code continues to evolve as I spot patterns and things, which is exactly the way I like it.
- By “gap” I don’t necessarily mean nothing ran but just that no jobs in the suite ran in that gap: In my test case I can see this suite waiting for other suites to get to a certain point.
- And nor would the scheduler’s schedule: It just describes the dependencies as the installation saw fit to identify them,
- DB2 data access is a notable case where SMF won’t tell you about the dependency.