(Originally posted 2011-05-14.)
This is the second part of a (currently) three-part series on processing XML data with DFSORT, given a little help from standard XML processing tools. The first part – which you should read before reading on – is here.
To recap, getting XML data into DFSORT is a two stage process:
- Flatten the XML data so that it consists of records with fields in sensible places.
- Process this flattened data with DFSORT / ICETOOL or something else, like REXX.
This post covers the first part of this. You’ll see how you can transform the XML file below into a Comma-Separated Variable (CSV) file.
Here’s the source XML, complete with a few quirks:
|XML File To Be Processed|
<?xml version="1.0"?> <mydoc> <greeting level="h1"> Hello World! </greeting> <stuff> <item a="1"> 1 <row>One</row> </item> <item 2 a="12"> <row>Two</row> </item> <item a="903"> 3 <row> Three </row> </item> </stuff> </mydoc>
Here’s the resulting flat file:
|Resulting Flat File For Processing With DFSORT / ICETOOL|
"One",1 "Two",12 "Three",903
I’m assuming you can read XML reasonably well. In this example we have three "item" elements as children of a "stuff" element. The "stuff" element is a child of the "mydoc" element. The "mydoc" element also contains a "greeting" element. Each "item" element has a single "row" child element and an "a" attribute.
To produce the output we need to find the "item" elements and pick up the "row" child element and the "a" attribute value. We write one record for each "item" element. (We ignore the "greeting" element entirely.)
You may notice some white space around the output: A leading blank line and a trailing one, as well as four spaces at the beginning of each output record. I’ve not found a way for getting rid of those and the DFSORT program (described in the next part of this series) will have to strip them off.
I’ve deliberately formatted each "item" element slightly differently:
- The "a" attribute is on the same line as the "item" tag, and the "row" element fits entirely on one line.
- The "a" attribute is on the next line, and the "row" element is on one line.
- The "a" attribute is as in 1 but the "row" element text is split across three lines.
The point is that XML is so flexible in its layout you’re better off relying on a supplied parser than writing your own. It’s true that there are good parsers that don’t do XSLT transformations. And obviously the z/OS System XML one is very nice, particularly with its ability to use specialty engines. As I said in my previous post, XML parsing is computationally expensive.
Why not write your own code that calls the z/OS System XML parser? That’s certainly an option – and indeed you might find the transformations you want to do can’t (or shouldn’t) be done with XSLT. Here the similarity to DFSORT is quite strong: Both provide ways to use built-in functions to transform data – neither of which require a formal programming language (in XML’s case perhaps PHP, java or C++ and DFSORT’s case perhaps Assembler, COBOL or PL/I).
In this example you scarcely need to write your own program. (Handling item 3, as I’ll describe later, is the one case where a program might be better.).
Here’s the XSLT stylesheet that produces the required output:
<?xml version="1.0"?> <xsl:stylesheet version="2.0" 1 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="text" encoding="IBM-1047"/> 2 <xsl:template match="/"> <xsl:apply-templates select="mydoc/stuff"/> 3 </xsl:template> <xsl:template match="item"> 4 <xsl:text>"</xsl:text> 5 <xsl:value-of select="normalize-space(row)"/> 6 <xsl:text>",</xsl:text> 7 <xsl:value-of select="@a"/> 8 </xsl:template> </xsl:stylesheet>
This is a fairly simple stylesheet. Here’s how it works (and the numbered lines above correspond to the numbering below:
- Here we declare the level of the XSLT language to be 2.0. In fact there’s nothing about this stylesheet that requires that language level.
- Here we say we’re creating a text file as output and that it will be EBCDIC (IBM-1047).
- Here we search for the "stuff" element within the "mydoc" element – using the XPath language. In fact the only "stuff" element we’ll match with is the one at the top of the XML node tree – because it’s preceded by a "/". For each matched "stuff" element we apply the template below.
- This template matches all "item" elements within the "stuff" element.
- Here text starts to be written out for the record. In this case the leading quote around the first piece of data.
- Here the first of piece of data is written out – the text value of the "row" element. We’ll come back to the normalize-space() function in a minute.
- Here a trailing quote and a comma are written out.
- Here the value of the "a" attribute is written out. It needs no adjustment (in this example).
Because item 3’s "row" value was split across several lines the normalize-space() function is used to take out leading white space. It has the unfortunate side-effect of replacing multiple white space characters in the text with a single space so it’s not brilliant. You could write a fairly simple but recursive piece of XSLT to do the job properly – but it’s beyond the scope of this post. In fact this might be the thing that makes you abandon XSLT and call the XML parser from a program.
If you want to get into XSLT I can recommend Doug Tidwell’s XSLT, Second Edition Mastering XML Transformations book. It’s what I’ve used – with some additional research on the web (which didn’t yield much additional insight).
I used the Saxon B (free) parser as it’s the only one I can get my hands on that does XSLT 2.0. It’s a java jar. You could use others, of course.
Invoking from the OMVS I found a 64MB heap specification was enough (running in a 128MB region). For more complex transformations I can see a larger heap might be needed. (In fact I didn’t check how much garbage collection, if any, the JVM did. It just ran.)
(If you specify version="1.0" for the stylesheet Saxon will issue a message informing you you’re running a 1.0 stylesheet through a 2.0 processor. This has caused no problems whatsoever for me.)
Originally I downloaded Saxon to my Linux laptop and used it with an ASCII stylesheet and XML data. Transferring to z/OS was straightforward. This approach may work for you, if you’re setting out to learn XSLT.
Learning and working with XSLT continues to be a journey of discovery. If I’m missing some tricks that you spot feel free to let me know. The next post in this series will be about the DFSORT counterpart.