(Originally posted 2009-11-27.)
A new set of function was recently made available for DFSORT via PTFs UK51706 and UK51707.
In this post I want to talk about the new JOINKEYS function, and try to add a little value by discussing some performance considerations. I’ve had the code for a couple of months and have played with it but not extensively. So much of what follows is based on thinking about the function (described in this document) and bringing some of my DB2 experience to bear.
With this enhancement DFSORT allows you to do all the kinds of two-way joins DB2 folks would expect to be able to do – in a single simple operation. "Two way" refers to joining two files together. You can perform e.g. a three-way join by joining two files together and then joining the resulting file with a third. With "raw" DFSORT that would be two job steps. With ICETOOL you can make this a single job step. In any case I think I’d recommend using ICETOOL because converting to ICETOOL later when you find you want to add a third file to the join would be additional work.
How JOINKEYS Works
Before talking about performance let me describe how JOINKEYS works. In JOINKEYS parlance we talk about files "F1" and "F2". Indeed the syntax uses those terms…
- The join itself is performed by the main DFSORT program task. I receives its data through a special E15 exit and processes it like any other DFSORT invocation, with the exception that it knows it’s doing a join. So things like E35 exits and OUTFIL all work as normal.
- Both F1 and F2 files are read by separate tasks. Each of these writes their data using an E35 exit. Normal processing capabilities such as E15 exits (potentially different for F1 and for F2) and INCLUDE / OMIT and INREC processing apply.
- The F1 and F2 tasks and the main tasks communicate by "pipes" constructed between the F1 and F2 E35 exits and the main task E15 exit. These pipes have no depth and don’t occupy significant working memory or any intermediate disk space.
A Potential For Parallelism?
So we have three DFSORT tasks operating in parallel, feeding data through pipes. In principle they could run on separate processors. The extent to which that’s useful would, I think, depend on whether these tasks are performing sorts or just reformatting copies. I say this because in the copy case I’d expect the F1 and F2 tasks to be interlocked with the main task whereas in the sort case there’s stuff to do before we get to writing through the pipes. And in the latter case we’re probably only effectively driving two separate processors. But this is a fine point.
In any case we derive I/O Parallelism because the F1 and F2 tasks run in parallel. Again its usefulness depends on timing.
Managing The Sorts
You can specify whether the F1 and F2 tasks perform a sorts. So you could declare that F1 was already sorted, whereas F2 wasn’t.
You can decide whether DFSORT will terminate if the F1 or F2 files are not in order. (This only applies and makes sense if you’ve claimed the data was already sorted.)
You can specify whether the main task sorts the results of the joined F1 and F2 files.
More on why sort avoidance might be important in a minute.
As I mentioned earlier, you can use repeated invocations of JOINKEYS (most readily using ICETOOL) to join more than two files together.
Now this is where some DB2 SQL tuning background comes in handy…
You have a choice which order to join the files in. As this isn’t DB2 you don’t have the Optimizer making such decisions for you. So you have to decide for yourself. But think about it: If you joined a large file to a small file in Step 1 and then joined the large resulting intermediate file to another small file in Step 2 you’ve chucked a lot of data around – twice. If you could arrange to join the large file in Step 2 to the results of joining the small files in Step 1 there would be less data chucking around. It ought to run faster.
Cutting Down The Data
As with all DFSORT invocations, cutting down the data early is important: Joining two large files together, only to throw away large amounts of the result is inefficient: If you can throw away unwanted records on the way in, or can throw away unwanted fields, the join will be more efficient. In the F1 and F2 tasks you can.
In the F1 and F2 tasks you can supply file size estimates – as they each have their own control files – by default “JNF1CNTL” and “JNF2CNTL”. You could do this for the main sort, too. In the F1 and F2 case this is more important when you cut down the files on the way in.
Avoiding Unnecessary Sorts
If you know the files you are joining are already sorted in an appropriate order for the join you can avoid sorts on the way into the join. And this will obviously be more efficient. If you can live with the order DFSORT writes the records from JOINKEYS you can use COPY rather than SORT in the main task.
In the worst case – where F1 and F2 files are sorted in parallel and where the main task also sorts data – you have the potential for large amounts of memory being necessary. You need to cater for that.
I really like this function. It removes the need for much fiddliness – and it does it in a simple way. (I’m conscious I’ve shown no examples but the documentation linked to above is replete with them.)
My perspective is as a performance guy who has some knowledge of how DB2 does joins. This isn’t the same code so the lessons from the DB2 Optimizer have to be applied sparingly. And note we don’t even have indexes on sequential files (though you could simulate an “index scan” by retrieve only the join keys.)
I’d like to do some performance runs that illustrate the points above. I’m a little tight on time right now – so that’ll have to wait. And I’m sure there’s more thinking that could be done on how to tune JOINKEYS invocations.