(Originally posted 2012-07-02.)
DFSORT’s Arithmetic operators can do many things but the one thing they can’t do is take the square root of a number.
You might think that’s minor but it means you can’t calculate a Standard Deviation. (Variance is fine but dimensionally not so nice when used in conjunction with the Mean.) And I needed Standard Deviation in a real live customer situation.
So I set out to "roll my own". And here’s how I did it…
There is a well known technique for calculating square roots – the Babylonian Method (or Heron’s Method). It’s not terribly advanced maths, frankly: I think I was taught it at about age 14. It goes like this:
- Call your number whose square root is to be found a, and pick a first guess for the root. We’ll call this quess x0.
- Write x1 = (a + a/x0)/2. In other words xn+1 is the average of a and a/xn.
- Iterate until done. We’ll talk about what "done" means later.
As you can see it’s a really very simple method and converges reasonably quickly. And it’s not difficult to program, except for one thing: DFSORT doesn’t have a looping capability. (Yes it loops over the input records, but not in the way I mean.)
Before I show you some code let me talk about some considerations:
- Because DFSORT’s arithmetic is integer I describe here taking the integer square root of an integer field value. So, for example, it yields sqrt(5)=2.
- All the arithmetic on the way to finding the root is integer also.
- The code takes square roots of non-negative integers: No imaginary roots here.
- The method returns positive roots: As you probably know roots come in pairs, differing only by a sign.
- The algorithm requires you to provide a first estimate (x0). My first take was a/2. But in the case of a=1 this leads to a root of 0 straightaway – which is particularly unhelpful. So I had to special-case a=1. So the take I went with in the end is x0=a. Not generally a brilliant first guess but avoids the special case.
- Because this is integer arithmetic completion can’t be only when sqrt(a)*sqrt(a)=a as in most cases this condition isn’t going to be met. Instead I use xn+1=xn as completion: If the two are equal the algorithm isn’t going to produce a different xn+2.
A reasonable question to ask is whether integer square roots are useful…
A standard technique with DFSORT’s Arithmetic operators is to boost the integer values to represent the number of decimal places required. So, for example, to add together 3.14 and 2.72 you’d be storing them as 314 and 272. Then you’d add these integers and use EDIT=(I.TT) to display them as "4.86". So if you wanted to take the square root of 3.1416 you’d multiply by 10,000 and then apply my technique, using EDIT=(IIII.TT) to move the decimal point back.
Structure Of The Example
This is a standard DFSORT invocation. Rather than give you the usual scaffolding I’ll show you the main components:
- DFSORT Symbols where possible to help keep it clear.
- COPY (though it could’ve been SORT or MERGE).
- INREC to parse the free-form input data.
- OUTREC with multiple IFTHEN clauses to compute the square root.
- OUTFIL to format the output lines.
You could, in fact, structure this program differently. It might (though I haven’t worked through it) combine everything into a single INREC statement (apart from the COPY specification).
Some Sample Data
In my SORTIN I have the following lines:
100 5 1 27 3 4 999 999999 99999999
In other words a single field in each record in EBCDIC (rather than binary) form. (You can generalise all that follows with a small amount of thought.)
(In my test case this is instream data – so it’s fixed-length records and hence no RDW).
Parsing EBCDIC Numbers.
In this simple case I created a set of records with binary representations of the numbers using INREC:
OPTION COPY INREC FIELDS=(1,10,UFF,TO=BI)
Here the UFF does the parsing and it parses the 10 bytes starting in position 1, with the output being 4-byte BI fields.
A SYMNAMES Deck To Simplify Things
You may know you can define symbols with DFSORT. Here’s the deck I used in this example:
//SYMNAMES DD * INPUT,1,4,BI SQRT,*,4,BI PREV,*,4,BI /*
(You’ll want a SYMNOUT DD to go with it.)
- INPUT maps the integer value created by INREC. It’s a, the number whose square root we’re seeking.
- SQRT maps the field where successive estimates of the root are stored.
- PREV maps the field where the previous estimate is stored – so we can tell if an iteration has changed the estimate of the root.
Taking The Square Root
In my example all the work is done in a series of IFTHEN clauses in an OUTREC statement:
- A single WHEN=INIT clause initialises the estimate (SQRT) and clears the previous estimate (PREV).
- Repeated WHEN=(SQRT,NE,PREV) clauses compute successive approximations – but only if the last two estimates weren’t the same.
Here’s the code:
OUTREC IFTHEN=(WHEN=INIT, OVERLAY=(SQRT,INPUT,X'00000000')), IFTHEN=(WHEN=(SQRT,NE,PREV),HIT=NEXT, OVERLAY=(PREV:SQRT,SQRTSQRT,ADD,(INPUT,DIV,SQRT)),DIV,+2, TO=BI,LENGTH=4)), IFTHEN=(WHEN=(SQRT,NE,PREV),HIT=NEXT, OVERLAY=(PREV:SQRT,SQRT
SQRT,ADD,(INPUT,DIV,SQRT)),DIV,+2, TO=BI,LENGTH=4)), ... IFTHEN=(WHEN=(SQRT,NE,PREV),HIT=NEXT, OVERLAY=(PREV:SQRT,SQRT
SQRT,ADD,(INPUT,DIV,SQRT)),DIV,+2, TO=BI,LENGTH=4))
This you’ll recognise as an unrolled loop. The "…" says you can repeat the previous IFTHEN clause as many times as you like.
A good question would be "how many times?" I found good convergence after 15. For example, the 99999999 value went to 10028 in 15 iterations and 10000 in 16. You could pick 20 and be sure of getting an accurate value.
Let’s talk about control flow through the queue of IFTHEN clauses…
- The WHEN=INIT is always executed.
- HIT=NEXT on a WHEN=(logexp) clause means "even if this test succeeds proceed to the next WHEN=(logexp) clause having executed this one’s OVERLAY".
- Even if a WHEN=(logexp)’s logical expression is false all subsequent WHEN=(logexp) clauses are processed – even if, as in this case, they’ll evaluate to false. So this isn’t really a "WHILE(logexp) DO … END" construct (but you can construct the logexp so it works that way).
In the past I’ve likened multiple IFTHEN clauses to a pipeline. This example shows why.
Printing The Results
The following is a quite complicated piece of report formatting. Its aim is to format numbers and then squeeze out the resulting leading blanks. At the same time it has to preserve blanks in text.
OUTFIL IFTHEN=(WHEN=INIT, BUILD=(C'x=',INPUT,EDIT=(IIIIIIIIIIT), C'_sqrt(x)=',SQRT,EDIT=(IIIIIIIIIIIT), C'_sqrt(x)*sqrt(x)=',SQRT,MUL,SQRT,EDIT=(IIIIIIIIIIIT), C'_prev_est=',PREV,EDIT=(IIIIIIIIIIIT))), IFTHEN=(WHEN=INIT, BUILD=(1,90,SQZ=(SHIFT=LEFT,PAIR=QUOTE))), IFTHEN=(WHEN=INIT, FINDREP=(IN=C'_',OUT=C' '))
Let’s examine it more closely:
The first WHEN=INIT takes four numbers and formats them:
- The number we wanted to find the square root of,
- The square root,
- What happens when you square that back up again,
- The previous estimate of the square root.
These are formatted with leading zeroes turned into blanks and some text is wrapped around them (with underscores standing in for blanks).
The second WHEN=INIT squeezes all the blanks out. This is why I used underscores in the wrapping text.
The third WHEN=INIT turns all the underscores into blanks.
Output Using The Sample Data
The output below isn’t particularly pretty. I wanted to use the OUTFIL code I’ve shown you to demonstrate the ability to selectively squeeze blanks out.
x=100 sqrt(x)=10 sqrt(x)*sqrt(x)=100 prev est=10 x=5 sqrt(x)=2 sqrt(x)*sqrt(x)=4 prev est=2 x=1 sqrt(x)=1 sqrt(x)*sqrt(x)=1 prev est=1 x=27 sqrt(x)=5 sqrt(x)*sqrt(x)=25 prev est=5 x=3 sqrt(x)=2 sqrt(x)*sqrt(x)=4 prev est=1 x=4 sqrt(x)=2 sqrt(x)*sqrt(x)=4 prev est=2 x=999 sqrt(x)=31 sqrt(x)*sqrt(x)=961 prev est=31 x=999999 sqrt(x)=999 sqrt(x)*sqrt(x)=998001 prev est=1000 x=99999999 sqrt(x)=10028 sqrt(x)*sqrt(x)=100560784 prev est=10784
Although this has been quite a complex post I hope it’s taught you a few new DFSORT tricks. Actually, the “meat” of it is the Square Root calculation – which I’m using “in anger” (AKA “for real”) right now. And the technique should be readily adaptable.
The question is: Are there other Newton-Raphson-style computations worth doing?