Pretty Printing XML in Python

(Originally posted 2011-06-22.)

A need has arisen for pretty printing XML. This post includes some Python code to do it.

I’ve been working with the OpenOffice.org ODP presentation file format recently: I want to generate presentations from application code.

An ODP file is a structured zip file. Among other items it contains an XML file – content.xml – which describes the pages in the presentation. content.xml concatenates everything as one long line of text. It’s therefore very hard to read – without specialist XML parsing tools. And I’d like to – so I can teach my code to navigate and update it.

So the following is a short Python program to pretty print an XML file. It takes the XML file as standard input (stdin) and writes the formatted XML to standard output (stdout). If it can’t parse the XML it writes an appropriate error message to standard error (stderr).

While this isn’t a Python tutorial I’d like to outline how it works:

  • I’m using the built-in xml.dom.minidom package to parse the XML and to pretty print it.
  • I’m using the built in re regular expression package to remove trailing spaces on a line that minidom’s toprettyxml() seems to create.
  • I’m using standard Python replace() to do some other substitutions to further clean up the XML.
  • If the read-in XML isn’t valid minidom throws a ExpatError. In fact it uses xml.parsers.expat and this is what throws the error.

I’ll admit to a little disappointment that toprettyxml() doesn’t create quite as pretty an output file as I’d like. Hence the post-processing. Feel free to experiment with different values for the parameters to it. Also you might not feel the need for some of the post-processing.

Here’s the code. It’s more white space, comments and error handling than actual straight-through processing.


#!/usr/bin/python
from xml.dom import minidom
from xml.parsers.expat import ExpatError
import sys,re

# Edit the following to control pretty printing
indent="  "
newl=""
encoding="UTF-8"

# Regular expression to find trailing spaces before a newline
trails=re.compile(' *\n')

try:
  # Parse the XML - from stdin
  dom=minidom.parse(sys.stdin)
  
  # First-pass Pretty Print of the XML
  prettyXML=dom.toprettyxml(indent,newl,encoding)
  
  # Further clean ups
  prettyXML=prettyXML.replace("\t","")
  prettyXML=prettyXML.replace('"?><','"?>\n<')
  prettyXML=re.sub(trails,"\n",prettyXML)
  
  # Write XML to stdout
  sys.stdout.write(prettyXML)

except ExpatError as (expatError):
  sys.stderr.write("Bad XML: line "+str(expatError.lineno)+" offset "+str(expatError.offset)+"\n")

One question: Can this be run on z/OS? I believe it can: minidom is a core part of Jython and Jython certainly runs under z/OS. In fact, being a java jar I’d expect it to run on a zAAP (or zIIP with zAAP-on-zIIP). If you try it and get it working let us know.

Published by Martin Packer

.

One thought on “Pretty Printing XML in Python

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: