# Cloning Fan-In

(Originally posted 2013-03-14.)

Suppose you have a set of numbers S over which you define a function f. Further suppose you partition S into S1 and S2. I’d like to know what function g is such that g({f(S1),f(S2)}) = f(S).

As much to the point I’d like to understand which functions f even have a corresponding function g that meets the condition.

Whoa! Was that pretentious enough for you? 🙂

Let me start again…

When I’ve talked about cloning batch jobs one of the problems to solve is what I call “fanning back in again”.

If you split the data into, say, 2 equal subsets and run it through 2 cloned jobs in parallel you have to take any “report file” and recreate it from whatever you could coax these clones to create.

(Maybe you should reread that first paragraph now.) 🙂

Let’s work through a simple example…

The clones work against subsets of records in file S we’ll call S1 and S2.

• Originally the function f created a report from S – simply totalling the value in a field of each record.

Then f(S1) sums that field over some of the records and f(S2) sums it over the others. You can probably guess that g is just adding the two together.

• As another example suppose f calculated an average of that field.

In this case recreating the average is merely a matter of counting the elements in S1 and S2 and using them to compute the overall average from the averages for each subset. (In fact just summing the values in S1 and S2 and dividing by the overall count would do just as well but involves changing the function f. You probably would prefer not to do that in general.)

• Calculating a maximum, standard deviation, or mode are three examples where it’s almost as simple as calculating the mean.

One feature of all of these is the need to carry forward information into some “fan-in” job step. In some of them it’s extra information – such as the subtotals for the mean. In others it’s the original information – such as the subtotals in the first case.

What I’d like to do is think about how one figures out whether such a fan in is even possible. I’m sure this isn’t a particularly new one – and any “divide and conquer” algorithm since time immemorial has had to deal with this issue. (I’m sure Hadoop has to deal with this, but we’re dealing with COBOL and PL/I here.) 🙂

And in Paragraph 1 I actually simplified it: 🙂 The “composition function” g should be designed to cope with arbitrary subsets of S – as we’re going to have to deal with 2-up, 4-up, 8-up cloning. It would be a real pity if the function only worked on pairs so 4-up would require 3 applications, for example. It should be a single sufficiently general function to allow the application to be readily cloned to whatever degree of parallelism required.

The whole thing is, of course, simplified: No report ever just plonks a single number on a page. (Unless that number is 42.) 🙂 Ultimately, though, you can break the problem down into a bunch of these simpler subproblems.

But if we are going to clone processing steps this is the kind of question that we’re going to have to answer: “Can we clone a job and still get the right results?”

And to finish here’s a nice pretty picture. 🙂 I might even make it into a slide or two. 🙂