Java Language

XML Parsing using the JAXP APIs

Remarks#

XML Parsing is the interpretation of XML documents in order to manipulate their content using sensible constructs, be they “nodes”, “attributes”, “documents”, “namespaces”, or events related to these constructs.

Java has a native API for XML document handling, called JAXP, or Java API for XML Processing. JAXP and a reference implementation has been bundled with every Java release since Java 1.4 (JAXP v1.1) and has evolved since. Java 8 shipped with JAXP version 1.6.

The API provides different ways of interacting with XML documents, which are :

  • The DOM interface (Document Object Model)
  • The SAX interface (Simple API for XML)
  • The StAX interface (Streaming API for XML)

Principles of the DOM interface

The DOM interface aims to provide a W3C DOM compliant way of interpreting XML. Various versions of JAXP have supported various DOM Levels of specification (up to level 3).

Under the Document Object Model interface, an XML document is represented as a tree, starting with the “Document Element”. The base type of the API is the Node type, it allows to navigate from a Node to its parent, its children, or its siblings (although, not all Nodes can have children, for example, Text nodes are final in the tree, and never have childre). XML tags are represented as Elements, which notably extend the Node with attribute-related methods.

The DOM interface is very usefull since it allows a “one line” parsing of XML documents as trees, and allows easy modification of the constructed tree (node addition, suppression, copying, …), and finally its serialization (back to disk) post modifications. This comes at a price, though : the tree resides in memory, therefore, DOM trees are not always practical for huge XML documents. Furthermore, the construction of the tree is not always the fastest way of dealing with XML content, especially if one is not interested in all parts of the XML document.

Principles of the SAX interface

The SAX API is an event-oriented API to deal with XML documents. Under this model, the components of an XML documents are interpreted as events (e.g. “a tag has been opened”, “a tag has been closed”, “a text node has been encountered”, “a comment has been encountered”)…

The SAX API uses a “push parsing” approach, where a SAX Parser is responsible for interpreting the XML document, and invokes methods on a delegate (a ContentHandler) to deal with whatever event is found in the XML document. Usually, one never writes a parser, but one provides a handler to gather all needed informations from the XML document.

The SAX interface overcomes the DOM interface’s limitations by keeping only the minimum necessary data at the parser level (e.g. namespaces contexts, validation state), therefore, only informations that are kept by the ContentHandler - for which you, the developer, is responsible - are held into memory. The tradeoff is that there is no way of “going back in time/the XML document” with such an approach : while DOM allows a Node to go back to its parent, there is no such possibility in SAX.

Principles of the StAX interface

The StAX API takes a similar approach to processing XML as the SAX API (that is, event driven), the only very significative difference being that StAX is a pull parser (where SAX was a push parser). In SAX, the Parser is in control, and uses callbacks on the ContentHandler. In Stax, you call the parser, and control when/if you want to obtain the next XML “event”.

The API starts with XMLStreamReader (or XMLEventReader), which are the gateways through which the developer can ask nextEvent(), in an iterator-style way.

Parsing and navigating a document using the DOM API

Considering the following document :

<?xml version='1.0' encoding='UTF-8' ?>
<library>
   <book id='1'>Effective Java</book>
   <book id='2'>Java Concurrency In Practice</book>
</library>

One can use the following code to build a DOM tree out of a String :

import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import java.io.StringReader;

public class DOMDemo {

public static void main(String[] args) throws Exception {
    String xmlDocument = "<?xml version='1.0' encoding='UTF-8' ?>"
            + "<library>"
            + "<book id='1'>Effective Java</book>"
            + "<book id='2'>Java Concurrency In Practice</book>"
            + "</library>";

    DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
    // This is useless here, because the XML does not have namespaces, but this option is usefull to know in cas
    documentBuilderFactory.setNamespaceAware(true);
    DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
    // There are various options here, to read from an InputStream, from a file, ...
    Document document = documentBuilder.parse(new InputSource(new StringReader(xmlDocument)));

    // Root of the document
    System.out.println("Root of the XML Document: " + document.getDocumentElement().getLocalName());

    // Iterate the contents
    NodeList firstLevelChildren = document.getDocumentElement().getChildNodes();
    for (int i = 0; i < firstLevelChildren.getLength(); i++) {
        Node item = firstLevelChildren.item(i);
        System.out.println("First level child found, XML tag name is: " + item.getLocalName());
        System.out.println("\tid attribute of this tag is : " + item.getAttributes().getNamedItem("id").getTextContent());
    }

    // Another way would have been
    NodeList allBooks = document.getDocumentElement().getElementsByTagName("book");
}
}

The code yields the following :

Root of the XML Document: library
First level child found, XML tag name is: book
id attribute of this tag is : 1
First level child found, XML tag name is: book
id attribute of this tag is : 2

Parsing a document using the StAX API

Considering the following document :

<?xml version='1.0' encoding='UTF-8' ?>
<library>
   <book id='1'>Effective Java</book>
   <book id='2'>Java Concurrency In Practice</book>
   <notABook id='3'>This is not a book element</notABook>
</library>

One can use the following code to parse it and build a map of book titles by book id.

import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;
import java.io.StringReader;
import java.util.HashMap;
import java.util.Map;

public class StaxDemo {

public static void main(String[] args) throws Exception {
    String xmlDocument = "<?xml version='1.0' encoding='UTF-8' ?>"
            + "<library>"
                + "<book id='1'>Effective Java</book>"
                + "<book id='2'>Java Concurrency In Practice</book>"
                + "<notABook id='3'>This is not a book element </notABook>"
            + "</library>";

    XMLInputFactory xmlInputFactory = XMLInputFactory.newFactory();
    // Various flavors are possible, e.g. from an InputStream, a Source, ...
    XMLStreamReader xmlStreamReader = xmlInputFactory.createXMLStreamReader(new StringReader(xmlDocument));

    Map<Integer, String> bookTitlesById = new HashMap<>();

    // We go through each event using a loop
    while (xmlStreamReader.hasNext()) {
        switch (xmlStreamReader.getEventType()) {
            case XMLStreamConstants.START_ELEMENT:
                System.out.println("Found start of element: " + xmlStreamReader.getLocalName());
                // Check if we are at the start of a <book> element
                if ("book".equals(xmlStreamReader.getLocalName())) {
                    int bookId = Integer.parseInt(xmlStreamReader.getAttributeValue("", "id"));
                    String bookTitle = xmlStreamReader.getElementText();
                    bookTitlesById.put(bookId, bookTitle);
                }
                break;
            // A bunch of other things are possible : comments, processing instructions, Whitespace...
            default:
                break;
        }
        xmlStreamReader.next();
    }

    System.out.println(bookTitlesById);
}

This outputs :

Found start of element: library
Found start of element: book
Found start of element: book
Found start of element: notABook
{1=Effective Java, 2=Java Concurrency In Practice}

In this sample, one must be carreful of a few things :

  1. THe use of xmlStreamReader.getAttributeValue works because we have checked first that the parser is in the START_ELEMENT state. In evey other states (except ATTRIBUTES), the parser is mandated to throw IllegalStateException, because attributes can only appear at the beginning of elements.

  2. same goes for xmlStreamReader.getTextContent(), it works because we are at a START_ELEMENT and we know in this document that the <book> element has no non-text child nodes.

For more complex documents parsing (deeper, nested elements, …), it is a good practice to “delegate” the parser to sub-methods or other objets, e.g. have a BookParser class or method, and have it deal with every element from the START_ELEMENT to the END_ELEMENT of the book XML tag.

One can also use a Stack object to keep around important datas up and down the tree.


This modified text is an extract of the original Stack Overflow Documentation created by the contributors and released under CC BY-SA 3.0 This website is not affiliated with Stack Overflow