Does It Taste As Good As It XMLs?
Dante Lorenso
In the last few years, XML has received great media attention, and most
languages support the parsing and extraction of data from XML documents.
Besides being a great three-letter anacronym to sprinkle on your
resumé, XML is actually a useful data storage structure
for PHP programmers.
Before you begin to use XML, you must first determine if your project
really needs what XML offers. There are alternative data storage formats
like fixed-width column files, tab-delimited files, CSV files, and database
tables, but these formats typically can only manage a simple grid of
rows and columns of data. XML provides several additional benefits for
programmers including:
- data format abstraction,
- simple document tag/data validation,
- the ability to store data in a tree-like heirarchy,
- platform independence,
- ease of integration,
- and more...
...but you already know that. What you want to do is use XML data inside
your sparkling new web application. We'll explore one simple way to do
this in the remainder of this article.
A Look At The XML File Structure
XML files are designed to be validated against a DTD and store data in a
format similar to something you'd see in an HTML document. All tags are
just made up on the fly (as defined by a DTD) and can represent a tree
structure. Here is an example of some XML:
<?xml version="1.0" encoding="UTF-8"?>
<!-- This is just a comment, ignore it -->
<drive desc="Letters and Numbers Harddrive">
<folder name="folder01">
<file name="a.txt"/>
<file name="b.txt"></file>
</folder>
<folder name="folder02">
<file name="c.txt"/>
<file name="d.txt" owner="bob">
This is a comment about file d.
We like comments.</file>
</folder>
</drive>
Now, if you look at this document, you'll notice that there exists one
<drive> tag with two folder tags inside it. Also,
each folder tag contains two file tags within
them. This file creates a tree-like structure of data. Now, we would
like to access this data from within PHP. There are many ways to get
to this data from within PHP including:
- manually parsing the XML file,
- parsing the file with the PHP SAX parser,
- using the XPath libraries to search and pull data,
- or using the DOM parser
The manual option is not our most robust solution, and the DOM support in
PHP is still experimental. So, I've chosen to use the SAX parser route.
However, unlike similar other solutions, I'd like to write an object in
PHP that parses this XML document into a PHP data structure so that I can
access the data like any other PHP data instead of having to write a custom
parser each time I use XML in an application.
Using An Array As A Data Structure
Knowing that the structure of the XML file is a tree, we need to find the best
way to represent that “tree” data in PHP. Well, my first idea
is to immediately consider a PHP array. Another option might be to build
objects similar to the DOM parser approach. I've decided not to write a
DOM parser, though (which you could easily do) because the DOM support is
coming along quickly enough. Why duplicate their efforts?
For simple XML, PHP arrays are perfect for the task because you can create
arrays of arrays of arrays and hence build a tree structure. Exactly
what we need for this learning exercise. Besides, there already exists a
plethora of functions built into the core PHP language for iterating
through arrays, pushing, popping, shifting, unshifting, splitting,
joining, slicing, etc.
To use the DOM model for inspiration, though, we'll need to store several
pieces of information about a given XML tag. Each tag in XML will contain
4 pieces of information that we want to store:
- name of the tag,
- tag attributes (keys and values),
- data (the content inside the tag open and close),
- and possibly other nested tags.
A PHP array that can represent this simple XML tag (also refered to as a
node in the tree) might look as follows:
<?php
$node = array();
$node['_NAME'] = 'folder'; // stores the node (tag) name
$node['_DATA'] = 'content'; // stores the text content inside tags
$node['_ELEMENTS'] = array(); // stores sub-nodes in order
$node['key1'] = 'value1'; // stores all other node attributes
$node['key2'] = 'value2'; // stores all other node attributes
$node['key3'] = 'value3'; // stores all other node attributes
?>
What I've done here is create an array of key and value pairs for all the
attributes in the node. Then, I've created 3 internal-use-only keys called
'_NAME', '_DATA', '_ELEMENTS' to store the tag name, tag data, and sub-node
array. By using the underscore ('_') I ensure that I'll not conflict with
an attribute name. Using the sub-node array, we can now create arrays of
arrays of arrays and basically build our tree.
Using our XML example again, suppose you wanted to read in some information
from the file where name is 'd.txt'... You'd first convert the XML into a
PHP array of arrays and then access the data with code like the following:
<?php
$file_name = $data['drive'][0]['folder'][1]['file'][1]['name'];
$owner = $data['drive'][0]['folder'][1]['file'][1]['owner'];
$comment = $data['drive'][0]['folder'][1]['file'][1]['_DATA'];
?>
Make PHP Do The Hard Work
PHP has a built-in process for parsing your XML document. You pass a string
to the xml_parse function with XML text in it and when the
XML document is parsed, handlers for the configured events are called
as many times as necessary. Some events for which you can write handlers
are 'StartElement', 'EndElement', and 'CharacterData'. Here is some sample
code for definine a class and the three event handlers to parse XML:
<?php
//######################################################################
class XMLToArray {
var $parser;
//----------------------------------------------------------------------
/* Parse a text string containing valid XML into a multidim array. */
function parse($xmlstring="") {
// set up a new XML parser to do all the work for us
$this->parser = xml_parser_create();
xml_set_object($this->parser, $this);
xml_parser_set_option($this->parser, XML_OPTION_CASE_FOLDING, false);
xml_set_element_handler($this->parser, "startElement", "endElement");
xml_set_character_data_handler($this->parser, "characterData");
// parse the data and free the parser...
xml_parse($this->parser, $xmlstring);
xml_parser_free($this->parser);
// ...
}
//----------------------------------------------------------------------
function startElement($parser, $name, $attrs) {
// Start a new Element. This means we push the new element onto
// the stackand reset it's properties.
printf("START: [%s]\n", $name);
// ...
}
//----------------------------------------------------------------------
function endElement($parser, $name) {
// End an element. This is done by popping the last element from
// the stack and adding it to the previous element on the stack.
printf("END: [%s]\n", $name);
// ...
}
//----------------------------------------------------------------------
function characterData($parser, $data) {
// Collect the data onto the end of the current chars.
printf("DATA: [%s]\n", str_replace("\n", "", $data));
// ...
}
//----------------------------------------------------------------------
}
//######################################################################
?>
Once we've built this class to wrap the PHP parser, we can create an instance
of the class and have it parse the XML sample code we described above. Some
sample code to do this would look as follows:
<?php
$xml2a = new XMLToArray();
$xml2a->parse($xml_text);
?>
Watching The XML Parsing Events: Callback Functions
What do we expect to happen when the above code is executed? Well, each time
the xml_parse function encounters an XML tag in our document, it'll fire an
event by calling the functions we told it to call. The term for this behavior
is often refered to as a 'Callback Function'. Basically we want PHP to call
us back at a given function name each time it triggers an event of a certain
type.
By using the function, xml_set_element_handler, we are
letting the PHP parser know that the open tag should invoke a method in
our class named 'startElement' and all close tags should invoke a method in
our class named 'endElement':
<?php
xml_set_element_handler($this->parser, "startElement", "endElement");
?>
Additionally, we want to capture all the character data between tags, so we
use the method, xml_set_character_data_handler to define
the callback function as 'characterData':
<?php
xml_set_character_data_handler($this->parser, "characterData");
?>
Callback functions are a very powerful tool that many languages offer and
they work great in this specific case. Until I write an article on using
callback functions, just accept that it 'simply works', and let's see what
events are fired as we parse our sample XML document:
START: [drive]
DATA: []
DATA: [ ]
START: [folder]
DATA: []
DATA: [ ]
START: [file]
END: [file]
DATA: []
DATA: [ ]
START: [file]
END: [file]
DATA: []
DATA: [ ]
END: [folder]
DATA: []
DATA: [ ]
START: [folder]
DATA: []
DATA: [ ]
START: [file]
END: [file]
DATA: []
DATA: [ ]
START: [file]
DATA: []
DATA: [ This is a comment about file d.]
DATA: []
DATA: [ We like comments.]
END: [file]
DATA: []
DATA: [ ]
END: [folder]
DATA: []
END: [drive]
Did you expect to see that? Notice that each time a tag is opened, we
see the 'START: [tagname]' line printed. When a tag is closed,
we see the 'END: [tagname]' lines. Finally, whenever data is encountered,
we get the 'DATA: [...]' lines. Notice, though that the data lines are not
necessarily together. Rules in parsing say that you can not guarantee that
the data will always be together in one chunk. In fact, it's likely that
it will NOT be together. The PHP parser is allowed to call your
characterData callback method as many times as it needs to
and you'll have to concat the strings together until the end tag is closed.
Building The Array Tree
At this point, we have a functioning class that will parse an XML document
and fire events. What we'll need to do now is modify the event handlers to
build our array tree using the array structure we defined above.
A simple algorithm for developing this code goes as follows:
- Each time a 'START' event is called, build a new node. The node we are
building is our array structure, but you might decide to build a Node
object using your own custom objects. The START event contains the
name and attributes of the tag. After building the node, push the new
node onto a class-level stack so that we can add data and sub-nodes from
the 'DATA' or 'END' elements later.
- Each time a 'DATA' event is called, append the text to the end of the
text inside the node which is currently on top of our stack.
- Each time an 'END' event is called, we need to pop a node off the stack,
finalize the node, and add it as a subnode to the node which is now on
the top of the stack. This completes the depth-first build of our tree
structure.
Before we start parsing the XML, it might help to push a 'root' node onto
an empty stack. This way, when the parsing is completed, we expect to
find only the root node remaining on the stack with all the subnodes built
beneath it.
The Completed XMLToArray Class
<?php
//######################################################################
class XMLToArray {
//----------------------------------------------------------------------
// private variables
var $parser;
var $node_stack = array();
//----------------------------------------------------------------------
/** PUBLIC
* If a string is passed in, parse it right away.
*/
function XMLToArray($xmlstring="") {
if ($xmlstring) return($this->parse($xmlstring));
return(true);
}
//----------------------------------------------------------------------
/** PUBLIC
* Parse a text string containing valid XML into a multidimensional array
* located at rootnode.
*/
function parse($xmlstring="") {
// set up a new XML parser to do all the work for us
$this->parser = xml_parser_create();
xml_set_object($this->parser, $this);
xml_parser_set_option($this->parser, XML_OPTION_CASE_FOLDING, false);
xml_set_element_handler($this->parser, "startElement", "endElement");
xml_set_character_data_handler($this->parser, "characterData");
// Build a Root node and initialize the node_stack...
$this->node_stack = array();
$this->startElement(null, "root", array());
// parse the data and free the parser...
xml_parse($this->parser, $xmlstring);
xml_parser_free($this->parser);
// recover the root node from the node stack
$rnode = array_pop($this->node_stack);
// return the root node...
return($rnode);
}
//----------------------------------------------------------------------
/** PROTECTED
* Start a new Element. This means we push the new element onto the stack
* and reset it's properties.
*/
function startElement($parser, $name, $attrs) {
// create a new node...
$node = array();
$node["_NAME"] = $name;
foreach ($attrs as $key => $value) {
$node[$key] = $value;
}
$node["_DATA"] = "";
$node["_ELEMENTS"] = array();
// add the new node to the end of the node stack
array_push($this->node_stack, $node);
}
//----------------------------------------------------------------------
/** PROTECTED
* End an element. This is done by popping the last element from the
* stack and adding it to the previous element on the stack.
*/
function endElement($parser, $name) {
// pop this element off the node stack
$node = array_pop($this->node_stack);
$node["_DATA"] = trim($node["_DATA"]);
// and add it an an element of the last node in the stack...
$lastnode = count($this->node_stack);
array_push($this->node_stack[$lastnode-1]["_ELEMENTS"], $node);
}
//----------------------------------------------------------------------
/** PROTECTED
* Collect the data onto the end of the current chars.
*/
function characterData($parser, $data) {
// add this data to the last node in the stack...
$lastnode = count($this->node_stack);
$this->node_stack[$lastnode-1]["_DATA"] .= $data;
}
//----------------------------------------------------------------------
}
//######################################################################
//## END OF CLASS
//######################################################################
?>
>How To Use The Class
The XMLToArray class we just built is rather simple in function. It
parses your XML document into a multidimensional array. Here is some
code that shows you how you might use this now:
<?php
require_once("XMLToArray.php");
$xml2a = new XMLToArray();
$root_node = $xml2a->parse($xml_text);
$drive = array_shift($root_node["_ELEMENTS"]);
//print('<pre>'); print_r($drive); print('</pre>');
// print all the folders...
foreach ($drive["_ELEMENTS"] as $folder) {
printf("FOLDER: %s\n", $folder["name"]);
// print all the files in this folder
foreach ($folder["_ELEMENTS"] as $file) {
printf("\tFILE: %s\n", $file["name"]);
}
}
?>
The output of this code would yield a display similar to the following:
FOLDER: folder01
FILE: a.txt
FILE: b.txt
FOLDER: folder02
FILE: c.txt
FILE: d.txt
In Summary
I use a version of this class to quickly import XML documents into a
multidimensional PHP array where I can then use PHP functions to manipulate
the array's contents. You might be able to enhance this class speed-wise with
the use of references on your stack, or your might optimize by building a
Node class instead of using our simple Array.
The real purpose of this article is not just to give you a working PHP
class for XML, but rather to show you how you might develop your own
XML parsing class and toolset. There exist other PHP resources like
XPath that will allow you to search and extract values from XML documents
more quickly than through this method. Additionally, as the DOM parser
matures, you may find that it performs this parsing functionality for you
but with C code which is many times faster. For simple XML needs, however,
speed of execution is rarely the bottleneck for your application and this
approach is sufficient and sometimes even a 'powerful' solution for
getting the job done.