Overview: An alternative to expat.
There are many xml tutorials for php on the web, but few show how to
parse xml using DOM. I would like to take this opportunity to show
there is an alternative to the widespread SAX implementation for php
programmers.
DOM (Document Object Model) and SAX (Simple API for XML) have
different philosophies on how to parse xml. The SAX engine is
extremely event-driven. When it comes across a tag, it calls an
appropriate function to handle it. This makes SAX very fast and
efficient. However, it feels like you're trapped
inside an eternal loop when writing code. You find yourself using many global variables
and conditional statements.
On the other hand, the DOM method is somewhat memory intensive. It
loads an entire xml document into memory as a hierarchy. The upside
is that all of the data is available to the programmer organized
much like a family tree. This approach is more intuitive,
easier to use, and affords better readability.
In order to use the DOM functions, you must configure php by specifying
the '--with-dom' argument. They are not a part of the standard
configuration. Here is a sample compilation.
%> ./configure --with-dom --with-apache=../apache_1.3.12
%> make
%> make install
How DOM structures XML
Since DOM loads an entire xml string or file into memory as a tree,
this allows us to manipulate the data as a whole. To show what xml
looks like as a tree, take this xml document as an example.
<?xml version="1.0"?>
<book type="paperback">
<title>Red Nails</title>
<price>$12.99</price>
<author>
<name first="Robert" middle="E" last="Howard"/>
<birthdate>9/21/1977</birthdate>
</author>
</book>
The data would be structured like this.
DomNode book
|
|-->DomNode title
| |
| |-->DomNode text
|
|-->DomNode price
| |
| |-->DomNode text
|
|-->DomNode author
|
|-->DomNode name
|
|-->DomNode birthdate
|
|-->DomNode text
Any text enclosed within tags are really nodes in themselves. For instance,
"Red Nails" is a child node of title, "$12.99" is a child node of
price.
The Objects Used In DOM
At this point, you are probably wondering what is a DomNode. This is
a good place to start talking about the objects that are included in
the module. There are five objects defined by DOM: DomDocument,
DomNode, DomAttribute, DomDtd, and DomNamespace. We are going to be
focusing primarily on the DomDocument and DomNode objects because they
are the most useful.
The Node object
Here is an overview of what the DomNode object contains.
class DomNode
properties:
name
content
type
methods:
lastchild()
children()
parent()
new_child( $name,$content )
getattr( $name )
setattr( $name,$value )
attributes()
The properties need some elaboration.
- The name property is the actual tag name of the node. A node which
refers to the the title tags would have the name of 'title'.
- The content property is usually empty. However, text nodes use this
property to hold text.
- The type property is a constant which defines exactly what kind of
object the node is. There can be several types of DomNode objects. A
list of constants are online at
http://www.php.net/manual/ref.domxml.php. For example, a DomNode
containing text would have a type of XML_TEXT_NODE.
The methods need to be explained, as well.
-
lastchild() returns the last entry from a node's children.
-
parent() returns a node's parent.
For instance, the parent of our title node would be 'book'.
-
children() returns an array of a node's child nodes. For example, the
children of node author would be 'name' and 'birthdate'.
-
new_child() takes a name and some content as arguments and adds a
new DomNode to its children.
-
getattr() and setattr() both deal with attributes. One
fetches the value, the other sets it.
-
attributes() returns an array of DomAttribute objects.
The DomDocument object
The DomDocument object is also important.
class DomDocument
properties:
version
encoding
standalone
type
methods:
root()
children()
add_root( $node )
dtd()
dumpmem()
The properties are pretty self explanatory.
- 'version' refers to the xml version of the document.
- 'encoding' refers to the text encoding.
- 'standalone' is a boolean value determining whether the document is
standalone or not.
- The 'type' property has already been explained. A Document object will
most likely have the type of XML_DOCUMENT_NODE.
The methods are pretty simple too.
-
root() returns the root node of a document. If we loaded our sample
xml file as a DomDocument object, the root node would refer to 'book'.
-
children() works just as it did in DomNode.
-
add_root() adds a new root node to the xml document. You would use
this if you wanted to supplant the 'book' node with another node.
-
dtd() returns the xml document's dtd.
-
dumpmem() returns a string representation of the xml data.
The DomDocument Object Returned By xmltree()
Xmltree(), a function which I haven't introduced yet, returns a
type of DomDocument object which may give you trouble. This object
has no methods, just properties in place of methods. It has a true
tree structure to it.
class DomDocument
properties:
version
encoding
standalone
name
content
type
attributes
children
It is just as easy to use. For instance, instead of using a
method to get a node's children, just access its 'children' property.
'children' and 'attributes' are both arrays.
The Other Objects
I will list the other objects and their properties and methods just for
reference. We won't be dealing with them in this article.
class Attribute
properties:
name
content
methods:
name()
class Dtd
properties:
extid
sysid
name
class Namespace
Using the Objects
The DOM module only has three functions, xmldoc(), xmldocfile(), and
xmltree(). The rest of the time, we will be dealing with the objects.
All functions return DomDocument objects. Here are examples of how
you load xml data into your php script:
<?php
# to load xml from a string
# use either of these
$doc = xmldoc( $xmlstr );
$tree = xmltree( $xmlstr );
# to load xml from a file
$doc = xmldocfile( $xmlfile );
?>
All functions will throw an error, if the xml cannot be parsed
correctly. DOM will not validate xml for you. You must find another
way of doing that. Perhaps through another program like xmllint.
A Simple Example
Let's start with a simple example to tie everything together.
<?php
# make an example xml document to play with
$xmlstr = "<" . "?" . "xml version=\"1.0\"" . "?" . ">";
$xmlstr .=
"
<employee>
<name>Matt</name>
<position type=\"contract\">Web Guy</position>
</employee>
";
# load xml data ($doc becomes an instance of
# the DomDocument object)
$doc = xmldoc($xmlstr);
# get root node "employee"
$employee = $doc->root();
# get employee's children ("name","position")
$nodes = $employee->children();
# let's play with the "position" node
# so we must iterate through employee's
# children in search of it
while ($node = array_shift($nodes))
{
if ($node->name == "position")
{
$position = $node;
break;
}
}
# get position's type attribute
$type = $position->getattr("type");
# get the text enclosed by the position tag
# shift the first element off of position's children
$text_node = array_shift($position->children());
# access the content property of the text node
$text = $text_node->content;
# echo out the position and type
echo "position: $text<BR>";
echo "type: $type";
?>
The example should print out the following:
position: Web Guy
type: contract
The while loop is essential for finding the position node. The
employee node really has five children nodes: three text, one name,
and one position. The text nodes contain the newlines at the end of
the lines. This may seem strange at first, but DOM considers any
string (even those containing only whitespace) as text and makes an
appropriate node for them.
If you want to ensure that the employee node only has two child
nodes, you will have to write the xml entry like this
.
<employee><name>Matt</name><position type="contract">Web Guy</position></employee>
A Longer Example
Here is a longer example of how to extract info from an xml doc. For example,
we have a file called employees.xml containing employee entries.
<?xml version="1.0"?>
<employees company="zoomedia.com">
<employee>
<name>Matt</name>
<position type="contract">Web Guy</position>
</employee>
<employee>
<name>George</name>
<position type="full time">Mad Hacker</position>
</employee>
<employee>
<name>Wookie</name>
<position type="part time">Hairy SysAdmin</position>
</employee>
</employees>
Here's how you would extract this info in your php script.
<?php
# iterate through an array of nodes
# looking for a text node
# return its content
function get_content($parent)
{
$nodes = $parent->children();
while($node = array_shift($nodes))
if ($node->type == XML_TEXT_NODE)
return $node->content;
return "";
}
# get the content of a particular node
function find_content($parent,$name)
{
$nodes = $parent->children();
while($node = array_shift($nodes))
if ($node->name == $name)
return get_content($node);
return "";
}
# get an attribute from a particular node
function find_attr($parent,$name,$attr)
{
$nodes = $parent->children();
while($node = array_shift($nodes))
if ($node->name == $name)
return $node->getattr($attr);
return "";
}
# load xml doc
$doc = xmldocfile("employees.xml") or die("What employees?");
# get root Node (employees)
$root = $doc->root();
# get an array of employees' children
# that is each employee node
$employees = $root->children();
# shift through the array
# and print out some employee data
while($employee = array_shift($employees))
{
if ($employee->type == XML_TEXT_NODE)
continue;
$name = find_content($employee,"name");
$pos = find_content($employee,"position");
$type = find_attr($employee,"position","type");
echo "$name the $pos, $type employee<br>";
}
?>
You should see the following in your browser.
Matt the Web Guy, contract employee
George the Mad Hacker, full time employee
Wookie the Hairy SysAdmin, part time employee
Another example (adding data)
Since the xml is loaded into memory as a tree, we can easily
manipulate the data. We can add branches or nodes when necessary.
Say we want to add an employee to our xml file.
<?php
# quick function for making child nodes
function make_node($parent,$name,$content)
{
# adds a new child node to parent node
$parent->new_child($name,$content);
# return the newly added child as a reference
return $parent->lastchild();
}
# load xml file and get root node
$doc = xmldocfile("employees.xml") or die("Do you even have any employees?");
$root = $doc->root();
# give the new employee a name
$newguy = make_node($root,"employee","");
# add the new guy's name
make_node($newguy,"name","New Guy");
# add his position
$position = make_node($newguy,"position","Backup Gnome");
# set the 'type' attribute
$position->setattr("type","intern");
# dump our altered xml doc to the browser
echo $doc->dumpmem();
?>
This will print the xml to the browser, so you will most likely have
to 'View the Source' in order to see the data.
Conclusion
That's pretty much all there is to DOM xml. It's a simple approach to
parsing and manipulating xml in your scripts. I hope this article
will shed more light in this dusty corner of php.
-- Matt
References
php domxml source
php-4.0.2/ext/domxml.