(Originally posted 2011-06-22.)
A need has arisen for pretty printing XML. This post includes some Python code to do it.
I’ve been working with the OpenOffice.org ODP presentation file format recently: I want to generate presentations from application code.
An ODP file is a structured zip file. Among other items it contains an XML file – content.xml – which describes the pages in the presentation. content.xml concatenates everything as one long line of text. It’s therefore very hard to read – without specialist XML parsing tools. And I’d like to – so I can teach my code to navigate and update it.
So the following is a short Python program to pretty print an XML file. It takes the XML file as standard input (stdin) and writes the formatted XML to standard output (stdout). If it can’t parse the XML it writes an appropriate error message to standard error (stderr).
While this isn’t a Python tutorial I’d like to outline how it works:
- I’m using the built-in xml.dom.minidom package to parse the XML and to pretty print it.
- I’m using the built in re regular expression package to remove trailing spaces on a line that minidom’s toprettyxml() seems to create.
- I’m using standard Python replace() to do some other substitutions to further clean up the XML.
- If the read-in XML isn’t valid minidom throws a ExpatError. In fact it uses xml.parsers.expat and this is what throws the error.
I’ll admit to a little disappointment that toprettyxml() doesn’t create quite as pretty an output file as I’d like. Hence the post-processing. Feel free to experiment with different values for the parameters to it. Also you might not feel the need for some of the post-processing.
Here’s the code. It’s more white space, comments and error handling than actual straight-through processing.
#!/usr/bin/python
from xml.dom import minidom
from xml.parsers.expat import ExpatError
import sys,re
# Edit the following to control pretty printing
indent=" "
newl=""
encoding="UTF-8"
# Regular expression to find trailing spaces before a newline
trails=re.compile(' *\n')
try:
# Parse the XML - from stdin
dom=minidom.parse(sys.stdin)
# First-pass Pretty Print of the XML
prettyXML=dom.toprettyxml(indent,newl,encoding)
# Further clean ups
prettyXML=prettyXML.replace("\t","")
prettyXML=prettyXML.replace('"?><','"?>\n<')
prettyXML=re.sub(trails,"\n",prettyXML)
# Write XML to stdout
sys.stdout.write(prettyXML)
except ExpatError as (expatError):
sys.stderr.write("Bad XML: line "+str(expatError.lineno)+" offset "+str(expatError.offset)+"\n")
One question: Can this be run on z/OS? I believe it can: minidom is a core part of Jython and Jython certainly runs under z/OS. In fact, being a java jar I’d expect it to run on a zAAP (or zIIP with zAAP-on-zIIP). If you try it and get it working let us know.
One thought on “Pretty Printing XML in Python”