picture of Matt Dunford

Overview: An alternative to expat.

There are many xml tutorials for php on the web, but few show how to parse xml using DOM. I would like to take this opportunity to show there is an alternative to the widespread SAX implementation for php programmers.
DOM (Document Object Model) and SAX (Simple API for XML) have different philosophies on how to parse xml. The SAX engine is extremely event-driven. When it comes across a tag, it calls an appropriate function to handle it. This makes SAX very fast and efficient. However, it feels like you're trapped inside an eternal loop when writing code. You find yourself using many global variables and conditional statements.
On the other hand, the DOM method is somewhat memory intensive. It loads an entire xml document into memory as a hierarchy. The upside is that all of the data is available to the programmer organized much like a family tree. This approach is more intuitive, easier to use, and affords better readability.
In order to use the DOM functions, you must configure php by specifying the '--with-dom' argument. They are not a part of the standard configuration. Here is a sample compilation.
%> ./configure --with-dom --with-apache=../apache_1.3.12
%> make
%> make install

How DOM structures XML

Since DOM loads an entire xml string or file into memory as a tree, this allows us to manipulate the data as a whole. To show what xml looks like as a tree, take this xml document as an example.
<?xml version="1.0"?>

<book type="paperback">
	<title>Red Nails</title>
	<price>$12.99</price>
	<author>
		<name first="Robert" middle="E" last="Howard"/>
		<birthdate>9/21/1977</birthdate>
	</author>
</book>
The data would be structured like this.
DomNode book
	|
	|-->DomNode title
	|		|
	|		|-->DomNode text
	|
	|-->DomNode price
	|		|
	|		|-->DomNode text
	|
	|-->DomNode author
			|
			|-->DomNode name
			|
			|-->DomNode birthdate
					|
					|-->DomNode text
Any text enclosed within tags are really nodes in themselves. For instance, "Red Nails" is a child node of title, "$12.99" is a child node of price.

The Objects Used In DOM

At this point, you are probably wondering what is a DomNode. This is a good place to start talking about the objects that are included in the module. There are five objects defined by DOM: DomDocument, DomNode, DomAttribute, DomDtd, and DomNamespace. We are going to be focusing primarily on the DomDocument and DomNode objects because they are the most useful.

The Node object

Here is an overview of what the DomNode object contains.
class DomNode
	properties:
		name
		content
		type
	methods:
		lastchild() 
		children() 
		parent() 
		new_child( $name,$content ) 
		getattr( $name ) 
		setattr( $name,$value ) 
		attributes() 
The properties need some elaboration.
The methods need to be explained, as well.
The DomDocument object
The DomDocument object is also important.
class DomDocument
	properties:
		version 
		encoding 
		standalone
		type
	methods:
		root() 
		children() 
		add_root( $node ) 
		dtd() 
		dumpmem() 
The properties are pretty self explanatory.
The methods are pretty simple too.

The DomDocument Object Returned By xmltree()

Xmltree(), a function which I haven't introduced yet, returns a type of DomDocument object which may give you trouble. This object has no methods, just properties in place of methods. It has a true tree structure to it.
class DomDocument
	properties:
		version
		encoding
		standalone
		name
		content
		type
		attributes
		children
It is just as easy to use. For instance, instead of using a method to get a node's children, just access its 'children' property. 'children' and 'attributes' are both arrays.

The Other Objects

I will list the other objects and their properties and methods just for reference. We won't be dealing with them in this article.
class Attribute
	properties:
		name
		content
	methods:
		name()

class Dtd
	properties:
		extid
		sysid
		name

class Namespace

Using the Objects

The DOM module only has three functions, xmldoc(), xmldocfile(), and xmltree(). The rest of the time, we will be dealing with the objects. All functions return DomDocument objects. Here are examples of how you load xml data into your php script:

<?php

# to load xml from a string
# use either of these
$doc xmldoc$xmlstr );
$tree xmltree$xmlstr );

# to load xml from a file
$doc xmldocfile$xmlfile );

?>
All functions will throw an error, if the xml cannot be parsed correctly. DOM will not validate xml for you. You must find another way of doing that. Perhaps through another program like xmllint.

A Simple Example

Let's start with a simple example to tie everything together.

<?php

# make an example xml document to play with
$xmlstr "<" "?" "xml version=\"1.0\"" "?" ">";
$xmlstr .=
"
<employee>
    <name>Matt</name>
    <position type=\"contract\">Web Guy</position>
</employee>
"
;

# load xml data ($doc becomes an instance of 
# the DomDocument object)
$doc xmldoc($xmlstr);

# get root node "employee"
$employee $doc->root();

# get employee's children ("name","position")
$nodes $employee->children();

# let's play with the "position" node
# so we must iterate through employee's
# children in search of it
while ($node array_shift($nodes))
{
    if (
$node->name == "position")
    {
        
$position $node;
        break;
    }
}

# get position's type attribute
$type $position->getattr("type");

# get the text enclosed by the position tag
# shift the first element off of position's children
$text_node array_shift($position->children());

# access the content property of the text node
$text $text_node->content;

# echo out the position and type
echo "position: $text<BR>";
echo 
"type: $type";

?>
The example should print out the following:
position: Web Guy
type: contract
The while loop is essential for finding the position node. The employee node really has five children nodes: three text, one name, and one position. The text nodes contain the newlines at the end of the lines. This may seem strange at first, but DOM considers any string (even those containing only whitespace) as text and makes an appropriate node for them.
If you want to ensure that the employee node only has two child nodes, you will have to write the xml entry like this
.
<employee><name>Matt</name><position type="contract">Web Guy</position></employee>

A Longer Example

Here is a longer example of how to extract info from an xml doc. For example, we have a file called employees.xml containing employee entries.
<?xml version="1.0"?>

<employees company="zoomedia.com">
	<employee>
		<name>Matt</name>
		<position type="contract">Web Guy</position>
	</employee>

	<employee>
		<name>George</name>
		<position type="full time">Mad Hacker</position>
	</employee>

	<employee>
		<name>Wookie</name>
		<position type="part time">Hairy SysAdmin</position>
	</employee>
</employees>
Here's how you would extract this info in your php script.

<?php

# iterate through an array of nodes
# looking for a text node
# return its content
function get_content($parent)
{
    
$nodes $parent->children();
    while(
$node array_shift($nodes))
        if (
$node->type == XML_TEXT_NODE)
            return 
$node->content;
    return 
"";
}

# get the content of a particular node
function find_content($parent,$name)
{
    
$nodes $parent->children();
    while(
$node array_shift($nodes))
        if (
$node->name == $name)
            return 
get_content($node);
    return 
"";
}

# get an attribute from a particular node
function find_attr($parent,$name,$attr)
{
    
$nodes $parent->children();
    while(
$node array_shift($nodes))
        if (
$node->name == $name)
            return 
$node->getattr($attr);
    return 
"";
}

# load xml doc
$doc xmldocfile("employees.xml") or die("What employees?");

# get root Node (employees)
$root $doc->root();

# get an array of employees' children
# that is each employee node
$employees $root->children();

# shift through the array 
# and print out some employee data
while($employee array_shift($employees))
{
    if (
$employee->type == XML_TEXT_NODE)
        continue;

    
$name find_content($employee,"name");
    
$pos find_content($employee,"position");
    
$type find_attr($employee,"position","type");

    echo 
"$name the $pos, $type employee<br>";
}

?>
You should see the following in your browser.
Matt the Web Guy, contract employee
George the Mad Hacker, full time employee
Wookie the Hairy SysAdmin, part time employee

Another example (adding data)

Since the xml is loaded into memory as a tree, we can easily manipulate the data. We can add branches or nodes when necessary.
Say we want to add an employee to our xml file.

<?php

# quick function for making child nodes
function make_node($parent,$name,$content)
{
    
# adds a new child node to parent node
    
$parent->new_child($name,$content);

    
# return the newly added child as a reference
    
return $parent->lastchild();
}

# load xml file and get root node
$doc xmldocfile("employees.xml") or die("Do you even have any employees?");
$root $doc->root();

# give the new employee a name
$newguy make_node($root,"employee","");

# add the new guy's name
make_node($newguy,"name","New Guy");

# add his position
$position make_node($newguy,"position","Backup Gnome");

# set the 'type' attribute
$position->setattr("type","intern");

# dump our altered xml doc to the browser
echo $doc->dumpmem();

?>
This will print the xml to the browser, so you will most likely have to 'View the Source' in order to see the data.

Conclusion

That's pretty much all there is to DOM xml. It's a simple approach to parsing and manipulating xml in your scripts. I hope this article will shed more light in this dusty corner of php.
-- Matt

References

domxml functions for php
http://www.php.net/manual/ref.domxml.php
DOM reference
http://www.w3.org/TR/
libxml, an essential library for building dom
ftp://ftp.gnome.org/pub/GNOME/stable/sources/libxml/
Short intro on the difference between DOM and SAX
http://www.builder.com/Programming/XMLToday/ss01.html?tag=st.bl.7267.dir1.XMLToday_01
php domxml source
php-4.0.2/ext/domxml.