Does It Taste As Good As It XMLs?

Dante Lorenso
In the last few years, XML has received great media attention, and most languages support the parsing and extraction of data from XML documents. Besides being a great three-letter anacronym to sprinkle on your resumé, XML is actually a useful data storage structure for PHP programmers.
Before you begin to use XML, you must first determine if your project really needs what XML offers. There are alternative data storage formats like fixed-width column files, tab-delimited files, CSV files, and database tables, but these formats typically can only manage a simple grid of rows and columns of data. XML provides several additional benefits for programmers including:
...but you already know that. What you want to do is use XML data inside your sparkling new web application. We'll explore one simple way to do this in the remainder of this article.

A Look At The XML File Structure

XML files are designed to be validated against a DTD and store data in a format similar to something you'd see in an HTML document. All tags are just made up on the fly (as defined by a DTD) and can represent a tree structure. Here is an example of some XML:

    <?xml version="1.0" encoding="UTF-8"?>
    <!-- This is just a comment, ignore it -->
    <drive desc="Letters and Numbers Harddrive">
        <folder name="folder01">
            <file name="a.txt"/>
            <file name="b.txt"></file>
        </folder>
        <folder name="folder02">
            <file name="c.txt"/>
            <file name="d.txt" owner="bob">
            This is a comment about file d.
            We like comments.</file>
        </folder>
    </drive>
Now, if you look at this document, you'll notice that there exists one <drive> tag with two folder tags inside it. Also, each folder tag contains two file tags within them. This file creates a tree-like structure of data. Now, we would like to access this data from within PHP. There are many ways to get to this data from within PHP including:
The manual option is not our most robust solution, and the DOM support in PHP is still experimental. So, I've chosen to use the SAX parser route. However, unlike similar other solutions, I'd like to write an object in PHP that parses this XML document into a PHP data structure so that I can access the data like any other PHP data instead of having to write a custom parser each time I use XML in an application.

Using An Array As A Data Structure

Knowing that the structure of the XML file is a tree, we need to find the best way to represent that “tree” data in PHP. Well, my first idea is to immediately consider a PHP array. Another option might be to build objects similar to the DOM parser approach. I've decided not to write a DOM parser, though (which you could easily do) because the DOM support is coming along quickly enough. Why duplicate their efforts?
For simple XML, PHP arrays are perfect for the task because you can create arrays of arrays of arrays and hence build a tree structure. Exactly what we need for this learning exercise. Besides, there already exists a plethora of functions built into the core PHP language for iterating through arrays, pushing, popping, shifting, unshifting, splitting, joining, slicing, etc.
To use the DOM model for inspiration, though, we'll need to store several pieces of information about a given XML tag. Each tag in XML will contain 4 pieces of information that we want to store:
  1. name of the tag,
  2. tag attributes (keys and values),
  3. data (the content inside the tag open and close),
  4. and possibly other nested tags.
A PHP array that can represent this simple XML tag (also refered to as a node in the tree) might look as follows:

<?php
    $node 
= array();
    
$node['_NAME']      = 'folder';     // stores the node (tag) name
    
$node['_DATA']      = 'content';    // stores the text content inside tags
    
$node['_ELEMENTS']  = array();      // stores sub-nodes in order
    
$node['key1']       = 'value1';     // stores all other node attributes
    
$node['key2']       = 'value2';     // stores all other node attributes
    
$node['key3']       = 'value3';     // stores all other node attributes
?>
What I've done here is create an array of key and value pairs for all the attributes in the node. Then, I've created 3 internal-use-only keys called '_NAME', '_DATA', '_ELEMENTS' to store the tag name, tag data, and sub-node array. By using the underscore ('_') I ensure that I'll not conflict with an attribute name. Using the sub-node array, we can now create arrays of arrays of arrays and basically build our tree.
Using our XML example again, suppose you wanted to read in some information from the file where name is 'd.txt'... You'd first convert the XML into a PHP array of arrays and then access the data with code like the following:

<?php
    $file_name  
$data['drive'][0]['folder'][1]['file'][1]['name'];
    
$owner      $data['drive'][0]['folder'][1]['file'][1]['owner'];
    
$comment    $data['drive'][0]['folder'][1]['file'][1]['_DATA'];
?>

Make PHP Do The Hard Work

PHP has a built-in process for parsing your XML document. You pass a string to the xml_parse function with XML text in it and when the XML document is parsed, handlers for the configured events are called as many times as necessary. Some events for which you can write handlers are 'StartElement', 'EndElement', and 'CharacterData'. Here is some sample code for definine a class and the three event handlers to parse XML:

<?php

//######################################################################
class XMLToArray {

    var 
$parser;

    
//----------------------------------------------------------------------
    /* Parse a text string containing valid XML into a multidim array. */
    
function parse($xmlstring="") {
        
// set up a new XML parser to do all the work for us
        
$this->parser xml_parser_create();
        
xml_set_object($this->parser$this);
        
xml_parser_set_option($this->parserXML_OPTION_CASE_FOLDINGfalse);
        
xml_set_element_handler($this->parser"startElement""endElement");
        
xml_set_character_data_handler($this->parser"characterData");

        
// parse the data and free the parser...
        
xml_parse($this->parser$xmlstring);
        
xml_parser_free($this->parser);

        
// ...
    
}

    
//----------------------------------------------------------------------
    
function startElement($parser$name$attrs) {
        
// Start a new Element.  This means we push the new element onto
        // the stackand reset it's properties.
        
printf("START: [%s]\n"$name);

        
// ...
    
}

    
//----------------------------------------------------------------------
    
function endElement($parser$name) {
        
// End an element.  This is done by popping the last element from
        // the stack and adding it to the previous element on the stack.
        
printf("END: [%s]\n"$name);

        
// ...
    
}

    
//----------------------------------------------------------------------
    
function characterData($parser$data) {
        
// Collect the data onto the end of the current chars.
        
printf("DATA: [%s]\n"str_replace("\n"""$data));

        
// ...
    
}

    
//----------------------------------------------------------------------
}

//######################################################################
?>
Once we've built this class to wrap the PHP parser, we can create an instance of the class and have it parse the XML sample code we described above. Some sample code to do this would look as follows:

<?php
$xml2a  
= new XMLToArray();
$xml2a->parse($xml_text);
?>

Watching The XML Parsing Events: Callback Functions

What do we expect to happen when the above code is executed? Well, each time the xml_parse function encounters an XML tag in our document, it'll fire an event by calling the functions we told it to call. The term for this behavior is often refered to as a 'Callback Function'. Basically we want PHP to call us back at a given function name each time it triggers an event of a certain type.
By using the function, xml_set_element_handler, we are letting the PHP parser know that the open tag should invoke a method in our class named 'startElement' and all close tags should invoke a method in our class named 'endElement':

<?php
xml_set_element_handler
($this->parser"startElement""endElement");
?>
Additionally, we want to capture all the character data between tags, so we use the method, xml_set_character_data_handler to define the callback function as 'characterData':

<?php
xml_set_character_data_handler
($this->parser"characterData");
?>
Callback functions are a very powerful tool that many languages offer and they work great in this specific case. Until I write an article on using callback functions, just accept that it 'simply works', and let's see what events are fired as we parse our sample XML document:
START: [drive]
DATA: []
DATA: [    ]
START: [folder]
DATA: []
DATA: [        ]
START: [file]
END: [file]
DATA: []
DATA: [        ]
START: [file]
END: [file]
DATA: []
DATA: [    ]
END: [folder]
DATA: []
DATA: [    ]
START: [folder]
DATA: []
DATA: [        ]
START: [file]
END: [file]
DATA: []
DATA: [        ]
START: [file]
DATA: []
DATA: [        This is a comment about file d.]
DATA: []
DATA: [        We like comments.]
END: [file]
DATA: []
DATA: [    ]
END: [folder]
DATA: []
END: [drive]
Did you expect to see that? Notice that each time a tag is opened, we see the 'START: [tagname]' line printed. When a tag is closed, we see the 'END: [tagname]' lines. Finally, whenever data is encountered, we get the 'DATA: [...]' lines. Notice, though that the data lines are not necessarily together. Rules in parsing say that you can not guarantee that the data will always be together in one chunk. In fact, it's likely that it will NOT be together. The PHP parser is allowed to call your characterData callback method as many times as it needs to and you'll have to concat the strings together until the end tag is closed.

Building The Array Tree

At this point, we have a functioning class that will parse an XML document and fire events. What we'll need to do now is modify the event handlers to build our array tree using the array structure we defined above.
A simple algorithm for developing this code goes as follows:
Before we start parsing the XML, it might help to push a 'root' node onto an empty stack. This way, when the parsing is completed, we expect to find only the root node remaining on the stack with all the subnodes built beneath it.

The Completed XMLToArray Class


<?php
//######################################################################
class XMLToArray {

    
//----------------------------------------------------------------------
    // private variables
    
var $parser;
    var 
$node_stack = array();

    
//----------------------------------------------------------------------
    /** PUBLIC
     * If a string is passed in, parse it right away.
     */
    
function XMLToArray($xmlstring="") {
        if (
$xmlstring) return($this->parse($xmlstring));
        return(
true);
    }

    
//----------------------------------------------------------------------
    /** PUBLIC
     * Parse a text string containing valid XML into a multidimensional array
     * located at rootnode.
     */
    
function parse($xmlstring="") {
        
// set up a new XML parser to do all the work for us
        
$this->parser xml_parser_create();
        
xml_set_object($this->parser$this);
        
xml_parser_set_option($this->parserXML_OPTION_CASE_FOLDINGfalse);
        
xml_set_element_handler($this->parser"startElement""endElement");
        
xml_set_character_data_handler($this->parser"characterData");

        
// Build a Root node and initialize the node_stack...
        
$this->node_stack = array();
        
$this->startElement(null"root", array());

        
// parse the data and free the parser...
        
xml_parse($this->parser$xmlstring);
        
xml_parser_free($this->parser);

        
// recover the root node from the node stack
        
$rnode array_pop($this->node_stack);

        
// return the root node...
        
return($rnode);
    }

    
//----------------------------------------------------------------------
    /** PROTECTED
     * Start a new Element.  This means we push the new element onto the stack
     * and reset it's properties.
     */
    
function startElement($parser$name$attrs) {
        
// create a new node...
        
$node = array();
        
$node["_NAME"]      = $name;
        foreach (
$attrs as $key => $value) {
            
$node[$key] = $value;
        }
        
$node["_DATA"]      = "";
        
$node["_ELEMENTS"]  = array();

        
// add the new node to the end of the node stack
        
array_push($this->node_stack$node);
    }

    
//----------------------------------------------------------------------
    /** PROTECTED
     * End an element.  This is done by popping the last element from the
     * stack and adding it to the previous element on the stack.
     */
    
function endElement($parser$name) {
        
// pop this element off the node stack
        
$node array_pop($this->node_stack);
        
$node["_DATA"] = trim($node["_DATA"]);

        
// and add it an an element of the last node in the stack...
        
$lastnode count($this->node_stack);
        
array_push($this->node_stack[$lastnode-1]["_ELEMENTS"], $node);
    }

    
//----------------------------------------------------------------------
    /** PROTECTED
     * Collect the data onto the end of the current chars.
     */
    
function characterData($parser$data) {
        
// add this data to the last node in the stack...
        
$lastnode count($this->node_stack);
        
$this->node_stack[$lastnode-1]["_DATA"] .= $data;
    }

    
//----------------------------------------------------------------------
}

//######################################################################
//##  END OF CLASS
//######################################################################
?>

>How To Use The Class

The XMLToArray class we just built is rather simple in function. It parses your XML document into a multidimensional array. Here is some code that shows you how you might use this now:

<?php
require_once("XMLToArray.php");
$xml2a      = new XMLToArray();
$root_node  $xml2a->parse($xml_text);
$drive      array_shift($root_node["_ELEMENTS"]);

//print('&lt;pre&gt;'); print_r($drive); print('&lt;/pre&gt;');

// print all the folders...
foreach ($drive["_ELEMENTS"] as $folder) {
    
printf("FOLDER: %s\n"$folder["name"]);

    
// print all the files in this folder
    
foreach ($folder["_ELEMENTS"] as $file) {
        
printf("\tFILE: %s\n"$file["name"]);
    }
}
?>
The output of this code would yield a display similar to the following:
FOLDER: folder01
	FILE: a.txt
	FILE: b.txt
FOLDER: folder02
	FILE: c.txt
	FILE: d.txt

In Summary

I use a version of this class to quickly import XML documents into a multidimensional PHP array where I can then use PHP functions to manipulate the array's contents. You might be able to enhance this class speed-wise with the use of references on your stack, or your might optimize by building a Node class instead of using our simple Array.
The real purpose of this article is not just to give you a working PHP class for XML, but rather to show you how you might develop your own XML parsing class and toolset. There exist other PHP resources like XPath that will allow you to search and extract values from XML documents more quickly than through this method. Additionally, as the DOM parser matures, you may find that it performs this parsing functionality for you but with C code which is many times faster. For simple XML needs, however, speed of execution is rarely the bottleneck for your application and this approach is sufficient and sometimes even a 'powerful' solution for getting the job done.