(Originally posted 2011-07-19.)
It may surprise you to know I hate asking questions to which I already know the answers. And I hate even more "leaving understanding on the table". Let me put it more positively: I love it when I can glean new insights into existing data. This post is about precisely that: An experiment in gleaning extra understanding…
In Batch Architecture, Part Zero and follow-on posts I talked about gleaning how an installation’s batch applications fit together. I’ll admit that part of it was a little sketchy and I’ve had the opportunity since then to look at a number of customer batch environments. I really don’t much like the part where I ask the customer "what’s your batch naming convention"? So I wrote some experimental code and tested it with one of these recent sets of data…
My raw data in this is SMF 30 Job-End records, processed into a database in my usual way. (And you, too, could do the same – and everything else that’s in this post.)
Remember I’m looking for patterns in 8-character tokens, and about 100,000 of them. The latter may be an under- or an over-estimate for you. The former is fixed. (And this technique might work with other bounded-size tokens such as DB2 Accounting Trace Correlation IDs or CICS region names.)
Here’s the process my code follows:
- Discern some masks from a pass over the data. (More about this towards the end of the post – but it is the first step.)
- Apply these masks to all the jobs and see which masks fit. (I’ll tackle this first as it explains why we need to do Step 1.)
Do These Jobs Match This Mask?
In this post a mask is a string of characters (for example "AAA999AA") against which each job name is tested. The "A" denotes "any alphabetic character in this position" and the "9" denotes "any numeric character in this position". So, in this example, a match would be a job name with the first three characters alphabetic, the next three numeric and the final two alphabetic.
(It’s perfectly reasonable to complicate things by allowing more than just "A" and "9". Perhaps "$" for non-alphanumeric and "*" or "?" as wildcards. I really don’t think that level of sophistication is necessary for this prototype – and Regular Expressions are probably overkill*.)
Because I knew the test data I used the espoused naming convention for the customer: "AAA999AA" is indeed the mask for this. My code shows that 86% of all batch jobs match this naming convention. So what about the other 14%? 🙂 Maybe that’s a metric: percent_jobs_matching_espoused_naming_convention.
I could’ve stopped there but I thought it useful to analyse the three-character "AAA" piece of the mask: There were 35 different values. Sorting these by occurrence descending I see 11 with over 100 occurrences (the top one having 732). These could be suites (or applications, if you prefer). This I’d be happy to share with a customer. It would enable the conversation to start somewhere more useful than "what is your naming convention?"
But, you’ll note, that’s one mask ("AAA999AA") that was already handed to me. Nice but not enough. I still think this "leaves understanding on the table".
How Do I Generate The Masks?
As I said, I think I can teach my code to do better than that. In fact I think I did…
With 8-character masks where each mask position can be in one of two states ("A" or "9") there are 256 potential masks (and that’s probably only 128 as I think the first position will have to be "A" – not that I’ve coded with that assumption). The point is there isn’t much potential for an explosion.
I glean the masks the following way. I run through all the job names, one character at a time:
- If the character present in, say, more than 90% of the job names is a letter I add "A" to any (partial) masks already generated.
- If the character is more than 90% of the time a number I add "9" to any partial masks.
- If not I create two sets of masks – one with an "A" on the end and one with the "9" on the end.
In this test I generated four masks: "AAA999AA", "AAAA99AA", "AAA9A9AA" and "AAAAA9AA". All the masks start with "AAA" and end with "9AA". The doubt is in the middle where "99", "A9", "9A" and "AA" got generated.
If I drop the threshold from 90% to 80% I only get "AAA999AA" so maybe that is a good naming convention after all. (In fact the middle characters are 87% and 88% numeric, respectively. And the sixth character is numeric 91% of the time – so it scraped through.)
As I said, my initial testing of the mask-matching used "AAA999AA" because the customer had indicated that was their convention. So my code allows you to specify masks and then adds the automatically-generated ones to it.
I think the experiment worked well. I can see cases where the code needs enhancing. I can see cases where it mightn’t be perfect. But I do think this code worth running (and tweaking) at the beginning of every relevant engagement.
* I’m doing my programming in REXX – which doesn’t even have regular expressions. It might be nice to write a function package that did it. A challenge for someone? Anyone? 🙂