(Originally posted 2012-02-26.)
I hope you don’t get the idea I’m overly into rigour, talking about Classification. But I think it has to be done – to provide terminology for this series of posts.
This is the second of four posts on Batch Parallelism, following on from Motivation.
If I think about how parallelism works in batch it broadly falls into two camps:
(If you look these two terms up in Wikipedia (possibly for the spelling) 🙂 you get to see under a rather tasty 🙂 graphic the words "Clam chowder, a heterogeneous material".) 🙂
Let me explain what I mean by these two, in terms of batch classification.
Almost all customers run more than one batch job at a time. Personally, I’ve never seen anyone feeding through a single job at a time.
But a lot of the time it’s separate suites (or applications, if you prefer). Or certainly it’s running dissimilar jobs alongside each other.
You can further divide this case – in a way which actually makes it less abstract: 1
This would be the case with totally separate suites, possibly from different lines of business.
- Weakly Linked
Again, these are separate suites, but this time the suites feed into each other – at least occasionally. These are less likely to be from separate lines of business – though a thoroughly integrated enterprise might have more cross-suite linkages.
- Strongly Linked
This would typically be the case of a single suite – where the whole point is to do related things, such that data flows between the jobs (and even steps).
By "linked" I’m mainly talking about data flows, though it could be operational cohesiveness.
This is the case where work is very strongly related. There are two subcases:
It’s quite common for applications to be (re-)engineered so that identical jobs run against subsets of the data. This is commonly termed "cloning".
- Within-Step Parallelism
An example of this is DB2 CP Query Parallelism – where DB2 splits the task up into, effectively, clones – but manages them as a single unit of work.
Not quite the same, but possibly best fitting here, is substep parallelism.
Which Do YOU Do?
I think most customers do "heterogeneous" to a very considerable degree. That’s because it comes naturally and is the way the business has grown and driven things.
Less common (and I was recently pressed to give a view on how common) is "homogeneous". That’s because it takes real effort.
The answer I gave was something along the lines of "I don’t know for certain but I guess about 30% of customers do homogeneous". 2The reason I gave that answer is because I suspect homogeneous parallelism gets added to applications to make them perform.
It’s my view that applications and going to have to become more homogenously parallel in the future – because of the dynamic I described in Part 1: Over time the speed up required of individual actors (typically batch jobs) is likely to outstrip that delivered by technology.
To become more homogeneously parallel we’re going to have to understand the batch applications much better. (Actually that’s true also of efforts for more heterogeneous parallelism as well. Parts 3 and 4 of this series will address some of this understanding – and provide some guidance on what’s going to need to be understood. And hopefully will make this classification seem less dry and more helpful. 🙂
1 There’s probably a rule that says the leaf nodes of classification schemes yield a higher proportion of concrete examples.
2 The "I don’t know for certain" part of it is because I recognise I see a "self-selecting group" or "biased sample" of customer situations: Those that are particularly thorny or exceptionally critical.