picture of Justin Grant
The Extensible Markup Language is definitely something that most developers will want to add to their toolbox. XML is a W3C standard that is open, language neutral, API-neutral, streamable, textual, human-readable and is a way of bringing structured data to the web. XML is a subset of SGML and is not a markup language in itself but allows the author to define their own markup language where it is best suited for representing hierachical data. Now, parsing XML documents with PHP is not a topic that has been covered extensively as far as I have seen on the web and elsewhere. There is some very useful information in the PHP manual on XML parsing functions but this seems to be all the information that I could find. Other languages seem to have much more information and working examples of using XML than PHP does, so in this article I will attempt to do my part to change this.
I will walk the reader through a fairly simple application of XML that I used to implement a news system for my home page. I actually use this on my web site and it works really well. Feel free to use it on yours if you like. Ok, let's get started !
To use the XML parsing functions available to PHP you will need to have a module that supports XML installed on your web server. This means you will probably have to recompile your module to support XML, please look here for more info on how to do this. The XML parsing functions are really wrappers around the expat parser which is a SAX parser, which means Simple API for XML. The other kind of parser is a DOM parser, easier to use, an example of this is Microsoft's MSXML component which lets the programmer navigate through a tree style object accessing nodes and elements. The way the expat parser (or any SAX parser) allows you to parse an XML document is to specify callback functions for different tag types in the XML document being parsed. When the parser starts parsing your XML document and encounters a tag it will call your function and process the tag as specified in your function before continuing. You could look at this as an event driven way of parsing.
Let's look at the XML document that is parsed by the 'Newsboy' class.

mynews.xml

<?xml version="1.0" standalone="no"?>
<!DOCTYPE NewsBoy SYSTEM "NewsBoy.dtd">

<NewsBoy>

<story>
	<date>03/31/2000</date>
	<slug>Sooo Busy !</slug>
	<text>
	I haven't posted anything here for a while now as I have been busy with work (have to pay those bills!). <newline></newline>
	I have just finished a neat little script that stores a complete record set in a session variable after <newline></newline>
	doing an SQL query. The neat part is that an XML doc is stored in the session variable an when paging <newline></newline>
	through the results (often near 1000!) the script displays 50 results at a time from the XML doc in the <newline></newline>
	session variable instead of doing another query against the database. It takes a BIG load off of the <newline></newline>
	database server.
	</text>
</story>

<story>
	<date>03/25/2000</date>
	<slug>NewsBoy Class</slug>
	<text>
	Converted Newsboy to a PHP class to allow better abstraction (as far as PHP allows.)   <newline></newline>
	Guess that means this is version 0.02 ?!					       <newline></newline>
	Newsboy will have a section of it's own soon on how to use and customize the class.    <newline></newline>
	</text>
</story>

<story>
	<date>03/24/2000</date>
	<slug>NewsBoy is up!</slug>
	<text>
	I have just finished NewsBoy v0.01 !!! <newline></newline>
	It looks quite promising. You may ask, &quot;What the heck is it?!&quot;. <newline></newline>
	Well it's a simple news system for web-sites, written in PHP, that makes use of XML for <newline></newline>
	the news data format allowing easy updating and portability between platforms.
It uses the built in expat parser for Apache.
This is just the very first version and there will be loads of improvements as the project progresses.
</text> </story> <story> <date>03/24/2000</date> <slug>Romeo must Die</slug> <text> Saw a really cool movie today at Mann called 'Romeo must Die' <newline></newline> Nice fight scenes for a typical kung-fu movie with some 'Matrix' style effects. <newline></newline> One particular cool effect was the 'X-Ray Vision' effect that occured in various fight scenes. <newline></newline> The hero, played by Jet Li, strikes a bad guy and you can see the bone in his arm crack, in X-RAY vision. <newline></newline> There were some funny scenes too when Jet has to play American football with the bad guys. <newline></newline> The official website for the movie is &lt;A HREF='http://www.romeo-must-die.com'&gt; here &lt;/A&gt; <newline></newline> <newline></newline> </text> &lt;IMG SRC=&quot;http://a1996.g.akamaitech.net/7/1996/25/e586077a88e7a4/romeomustdie.net/images/image15.jpg&quot; WIDTH=300 &gt; </story> </newsboy>

Ok, if you aren't familiar with XML documents then this might look a little scary, it's really not though. The first line is an XML declaration. The 'version' attribute tells the parser that the document conforms to the 1.0 XML standard as defined by the W3C. The 'standalone' option tells the program reading the XML that the document requires a DTD to validate the XML document (in this case the DTD resides in a separate file called 'NewsBoy.dtd', as specified by the DOCTYPE declaration on the next line, but it could be specified in the same file if you wanted it to be). The DOCTYPE declaration points to the root element of the XML document, which in this case is the 'NewsBoy' element. It also specifies the location of the DTD which exists in the same folder as the XML file itself. Please note, I am not validating the XML document against the DTD I have created because expat cannot validate an XML document. According to James Clark, the author of expat, the reason for this is in conformance with the W3C's specification for XML parsers, in that the parser need not validate the document but the programmer should deal with that.
The remainder of the document includes stories that are built with tags that I have defined in my NewsBoy class.
Let's look at the code for the PHP class that actually parses this document.

<?php

/*

NewsBoy:     a News system for the web written in PHP by Justin Grant
    (Web: jusgrant.cjb.net or justin.host.za.net Mail: justin@glendale.net)

25 March    V0.0.2    Converted Newsboy to a PHP class, allowing the layout to be easily modified.
            Also added made the HTML that is genrated a little easier to read.

24 March     V0.0.1    Just completed the intial version, very rough and basic.

*/

class newsboy {

    var 
$xml_parser;
    var 
$xml_file;
    var 
$html;
    var 
$open_tag ;
    var 
$close_tag ;

//Class Constructor
    
function newsboy() {
        
$this->xml_parser "";
        
$this->xml_file "";
        
$this->html "";
        
$this->open_tag = array(      //these are the default settings but they are quite easy to modify
            
"NEWSBOY" => "\n<!-- XML Starts here -->\n<TABLE COLS=1 CELLPADDING=5>",
                            
"STORY"     => "<TR><TD BGCOLOR=#222222>",
                            
"DATE" => "<FONT COLOR=#BBBB00>",
                            
"SLUG"  => "<FONT COLOR=#00AACC><B>&nbsp;&nbsp;",
                            
"TEXT"  => "<FONT COLOR=#AAAAAA>",
                            
"PIC" => "",
                            
"NEWLINE" => ""
                            
);
        
$this->close_tag = array(
                            
"NEWSBOY" => "</TABLE>\n<!-- XML Ends here -->\n\n",
                            
"STORY"     => "</TD></TR>",
                            
"DATE" => "</FONT>",
                            
"SLUG"  => "</B></FONT><BR>",
                            
"TEXT"  => "</FONT>\n",
                            
"PIC" => "<IMAGE SRC=",
                            
"NEWLINE" => "<BR>"
                            
);
        
        }
//Class Destructor (has to be invoked manually as PHP does not support destructors)
    
function destroy() {
            
xml_parser_free($this->xml_parser);
                }
                
//Class Members
    
function concat($str) {
        
$this->html .= $str;
        }    

    function 
startElement($parser$name$attrs) {
            
//global $open_tag;
    
            
if ($format$this->open_tag[$name]) {
                
$this->html .= $format;
            }
    }

    function 
endElement($parser$name) {
               global 
$close_tag;

                   if (
$format$this->close_tag[$name]) {
                        
$this->html .= $format;
                    }
            }

            function 
characterData($parser$data) {
                
$this->html .= $data;
            }
/*
            function PIHandler($parser, $target, $data) {
                //switch (strtolower($target)){
                //    case "php":
                         eval($data);
                //    break;
                //}
            }
*/
    
function parse() {

                
$this->xml_parser xml_parser_create();
        
xml_set_object($this->xml_parser, &$this);
        
// use case-folding so we are sure to find the tag in $map_array
        
xml_parser_set_option($this->xml_parserXML_OPTION_CASE_FOLDINGtrue);
        
xml_set_element_handler($this->xml_parser"startElement""endElement");
        
xml_set_character_data_handler($this->xml_parser"characterData");
        
//xml_set_processing_instruction_handler($this->xml_parser, "PIHandler");

        
if (!($fp fopen($this->xml_file"r"))) {
            die(
"could not open XML input");
        }    

                while (
$data fread($fp4096)) {
                    if (!
xml_parse($this->xml_parser$datafeof($fp))) {
                        die(
sprintf("XML error: %s at line %d",
                                    
xml_error_string(xml_get_error_code($this->xml_parser)),
                                    
xml_get_current_line_number($this->xml_parser)));
                    }
                }


    }


}

?>

In the class constructor I have defined two indexed arrays for opening and closing tags. The keynames have the same name as the tags that I am going to parse later and their corresponding values contain HTML for formatting on an open or close tag.
I have defined a simple class destructor to destroy the xml parser when we no longer require it. This destructor has to be called manually as PHP does not support class destructors when an object is destroyed.
I then define the main callback methods for parsing the open and close tags in the XML document. I also define a data parsing method that will simply format the data between the open and close tags, if there is any data between them. Later on I will show you how to register these callback methods with the parser.
In the startElement and closeElement (the functions called on an open and close tag, respectively) the corresponding array is queried with the name of the tag used as the index key. If the variable exists with that keyname then the value is retrieved and appened to the 'html' property of the class for later use when we actually want to display the document contents. The characterData method simply adds the value between the tags to the html property of the class. The method that has been remarked out called PIHandler is a callback function that I have not yet implemented that will process php script straight in the original XML document if it existed.
Now, let me explain the main parser method called, you guessed it, parse() !!!
The first line calls the function xml_parser_create() that returns an instance of the expat xml parser and stores it in the class property called $this->xml_parser.
Next, to register a callback function that is a method of a class we need to use the function xml_set_object().
I use it in this way , xml_set_object($this->xml_parser, $this). I specify the class property where the xml parser is stored in the first argument, then in the second argument I specify the address of the instantiated PHP object. This let's the parser know that all the callback functions to be registered are actually methods of the class specified at that address. This is like a 'pass by reference' in c or c++ or some call it simply 'referencing a variable'.
On the next line I set an xml parser option to use case folding by calling xml_parser_set_option(). Case folding just lets the parser know that I am not concerned about case-sensitivity when I parse my XML document but you could leave it unset if you wanted to define two different tags using case sensitivity i.e. <story> or <story>. Using the function xml_set_element_handler() I specify the callback functions that I will be using for start and end tags, namely "startElement" and "endElement".
Following that I use xml_set_character_data_handler() to specify the character data handler callback function named characterData(). The remarked out function call, xml_set_processing_instruction_handler(), is the call I would use to register the function PIHandler() to deal with php code that may be included in the XML document. The remaining code just simply reads the XML file and parses it. If an error occurrs then the details are returned along with the line number that the error occured on.
How do you use this class in a PHP script ? It's very simple actually. Here is an example:
First the class definition needs to be included in the script:
require (CLASS_DIR."class.Newsboy.php");
Then, we create an instance of the class and set the file property to the actual location of our XML document:
$news = new newsboy();
$news->xml_file = "xml/mynews.xml";
Or:
$news->xml_file = "http://xmldocs.mysite.com/mynews.xml"
Then we call the parse method to parse the document:
$news->parse();
Then we print out the html to the browser:
print ($news->html);
And, lastly when done with the object we must destroy it:
$news->destroy();
And that's really all there is to it !!!

Summary

In this brief article we have covered the following in regards to processing XML with PHP:
I hope that this introduction helps you in getting started and exploring the power of XML with PHP.
Some links that you may find helpful are listed below: