(Originally posted 2011-05-11.)
In the distant past I’ve written about using DFSORT to parse XML. This post (and two follow-on posts) will describe an experiment to make such processing much more robust.
In this post I’ll talk about what the problem I’m trying to solve is. And why. And a brief outline of my solution.
This isn’t meant to be the most detailed description of XML, nor a complete list of where it’s used. I just want you to know (if you didn’t already) why I think XML processing is something to pay attention to.
Increasingly applications are producing and consuming XML. (They’re also producing and consuming other new data styles, such as JSON.) I divide this usage into two categories:
- Configuration data (generally small files).
- Business data (often very large files).
XML has many advantages as a data format, including robustness, standardisation and an increasing degree of inter-enterprise adoption. It also has useful attributes like the ability to validate a file against a strict grammar and also transformability.
XML is, however, expensive to parse. And when I talk of transformability the tools to transform XML are still quite rudimentary – you often have to write your own program to do it.
(This being an IBM-hosted blog you might expect me to talk about Websphere Transformation Extender (WTX). I shan’t, except to say it has very nice tooling. Similarly, you might expect me to talk about the Extensible Stylesheet Language for Transformations (XSLT) – as a standard for transformations. You’re in luck with XSLT – but that will have to wait. I’d like to talk about IBM’s z/OS XML Toolkit (which includes an XSLT processor) but that will have to wait. And as for DataPower, it’ll be a while before I talk about it, also.)
Those of you familiar with IBM mainframe technology will be aware of z/OS System XML and perhaps the z/OS XML Toolkit. You’re probably aware of the ability to offload XML parsing to a zAAP (or zIIP if zAAP-on-zIIP is in play). I think our story’s pretty good with these.
So IBM thinks XML’s important, and so do lots of installations. It’s important that mainframe people know what they can do, too.
The Problem I’m Trying To Solve
I don’t feel it necessary to describe what DFSORT can do in this post. Suffice it to say it can do lots of what I call "slice and dice" with data. So long as that data is record-oriented. (And it’s even better if you include ICETOOL.)
So why don’t we just process XML with DFSORT?
(Let’s disregard publishing XML with DFSORT as that’s very easy to do.)
Traditionally DFSORT has done really well when records are neatly divided into fixed-position (and length) fields. Over recent years it’s got better and better at handling cases where the layout of each record is variable. For example, it can parse Comma Separated Value (CSV) files just fine – with PARSE.
But XML is so much more variable. For example, two partners could each send you a file, created by their own programming or tools. They’d be semantically equivalent but the data would be differently formatted (and still be valid according to the same XML Schema). And the differences wouldn’t just be the fields being at different offsets, or in a different order in the same record: One format might have an element all on one line whereas the other might spread it across three lines.
So any DFSORT application attempting to process XML would be vulnerable to this variability. In the past, when I’ve written of DFSORT processing XML I think I’ve said that you need stable XML to work with. I think that’s still right.
So is that it? Well, no it isn’t: I still think it’s possible to take advantage of DFSORT’s power, even with XML data to process. Read on…
XSLT (standing for Extensible Stylesheet Language for Transformations) is a standards-based way of transforming XML – to (different) XML, HTML or even plain text. And by "(different) XML" I also mean things like SVG vector graphics.
With XSLT you define a transformation using another piece of XML – a stylesheet (or XSL file). Whether you author this by hand (my current state) or use tooling to generate one is up to you. Using a program you use the XSL file to transform your XML to whatever you want.
There are lots of XSLT programs. I’ve used Apache Xalan (which is tightly-coupled to the IBM ones on z/OS), Saxon, the capabilities built in to Firefox (and other browsers), PHP’s one – to name just a few. Of these only Saxon can do XSLT 2.0 at present. (The others all do XSLT 1.0, often with extension capabilities.)
For my work, written up in these posts, I used the free variant of Saxon – because it does 2.0. Nothing in these posts, however, requires 2.0. I want 2.0 just so I can learn 2.0. One day maybe it’ll catch on and then I’ll be in good shape. Learning 2.0 isn’t incompatible with learning 1.0 but it might leave you frustrated. 🙂
The important piece in all this is that XSLT can be used to take arbitrary XML and flatten it – into records with fields in vaguely sensible places. In EBCDIC.
Putting It Together
So far I’ve talked about two distinct components: DFSORT / ICETOOL and XSLT. I’ve said it’d be nice to be able to process XML-originated data using DFSORT, robustly. So here’s how it can be done:
- Use XSLT to create a flat file (in HFS or zFS) with the data flattened into sensible records with well-delimited fields. (In the example, in the next post in this series, I’ll use CSV as the intermediate file layout.)
- Use DFSORT’s parsing capabilities to read the intermediate file and then do DFSORT’s normal things with it. (This will be the third post in the series.)
Conceptually simple but a little fiddly in the details. In the next two posts I’ll clothe the idea with some of those details.
Over the past few days, while preparing to write this post, I’ve done some experimenting – including creating a full working example. There are lots of "wrinkles" on this idea, including other ways of doing pieces of it. Perhaps you’ve thought of a few. If so do let us know.