picture of Joe Stump
Recently, at work I was given the job of learning XML. OK, so it wasn't technically XML, it was RDF, but I found that PHP's XML parsing functions worked just the same. At work I parsed out DMOZ (http://www.dmoz.org) but for simplicity I will stick with the basics of XML and then leave you free to parse DMOZ in your extra time ;o)
To begin with you will want to make sure that you have a PHP binary compiled with the '--with-xml' option enabled. Once that is complete you are ready to start parsing XML. Next grab Slashdot's XML file from their homepage (www.slashdot.org/slashdot.xml). Slashdot has a fairly simplistic file that is extremely easy to parse.
Remember that when you are working with XML it is a lot like working with a table in a database. You have a result index in the xml parser and a psuedo table in the XML document. Once you get over the differences you will be parsing in no time.
PHP's XML functions allow you to specify three functions that will handle the data in the XML file. One handles opening tags, one hands the data between tags, and the third handles the ending tags. Based on the name of the tags it gets passed you can then manipulate the data however you please. To begin with you need to look at your XML document and find out what tags are in the file. In our slashdot file we have STORY, TITLE, URL, TIME, AUTHOR, DEPARTMENT, TOPIC, COMMENTS, SECTION, and IMAGE. In some cases you would have attributes, and example is HREF is an attribute to A in HTML. PHP has an extremely cool way of handling attributes automagically. Next we need to define those tags in our script.

<?php

$open_tags 
= array(
    
'STORY' => '<STORY>',
    
'TITLE' => '<TITLE>',
    
'URL' => '<URL>');

$close_tags = array(
    
'STORY' => '</STORY>',
    
'TITLE' => '</TITLE>',
    
'URL' => '</URL>');

?>
I only want to parse out the above data because I just want to make one of those cool Slashboxes. Next on our list is to make the functions that will extract this data. On the following page are the functions that I created to do so.

<?php

// handles the attributes for opening tags
// $attrs is a multidimensional array keyed by attribute
// name and having the value of that attribute
function startElement($parser$name$attrs=''){
    global 
$open_tags$temp$current_tag;
    
$current_tag $name;
    if (
$format $open_tags[$name]){
    switch(
$name){
        case 
'STORY':
        echo 
'New Story: ';
        break;
        default:
        break;
    }
    }
}

// $current_tag lets us know what tag we are currently
// dealing with - we use that later in the characterData
// function.
//
// when we see a </STORY> we know that it is time to
// flush our temp variables and prepare to move onto
// the next one
function endElement($parser$name$attrs=''){
    global 
$close_tags$temp$current_tag;
    if (
$format $close_tags[$name]){
    switch(
$name){
        case 
'STORY':
        
return_page($temp);
        
$temp '';
        break;
        default:
        break;
    }
    }
}

// this function is passed data between elements
// theu $data would equal 'Title Here'
// in the line <TITLE>Title Here</TITLE>
function characterData($parser$data){
    global 
$current_tag$temp$catID;
    switch(
$current_tag){
    case 
'TITLE':
        
$temp['title'] = $data;
        
$current_tag '';
        break;
    case 
'URL':
        
$temp['url'] = $data;
        
$current_tag '';
        break;
    default:
        break;
    }
}

?>
As you can see so far XML parsing in PHP isn't all that bad. Now for the fun part - parsing the file! For that you will need the rest of the code, which is fairly simple.

<?php

function return_page(){
    global 
$temp;
    echo 
'o <A HREF="'.$temp['url'].'">'.$temp['title'].'</A><BR>';
}

// what are we parsing?
$xml_file 'slashdot.xml';

// declare the character set - UTF-8 is the default
$type 'UTF-8';

// create our parser
$xml_parser xml_parser_create($type);

// set some parser options 
xml_parser_set_option($xml_parserXML_OPTION_CASE_FOLDINGtrue);
xml_parser_set_option($xml_parserXML_OPTION_TARGET_ENCODING'UTF-8');

// this tells PHP what functions to call when it finds an element
// these funcitons also handle the element's attributes
xml_set_element_handler($xml_parser'startElement','endElement');

// this tells PHP what function to use on the character data
xml_set_character_data_handler($xml_parser'characterData');

if (!(
$fp fopen($xml_file'r'))) {
    die(
"Could not open $xml_file for parsing!\n");
}

// loop through the file and parse baby!
while ($data fread($fp4096)) {
    if (!(
$data utf8_encode($data))) {
        echo 
'ERROR'."\n";
    }
    if (!
xml_parse($xml_parser$datafeof($fp))) {
        die(
sprintf"XML error: %s at line %d\n\n",
        
xml_error_string(xml_get_error_code($xml_parser)),
        
xml_get_current_line_number($xml_parser)));
    }
}

xml_parser_free($xml_parser);

?>
Now this is what happens: PHP starts parsing along until if finds <ELEMENT ATTRIBUTE='bold'> it then passes ELEMENT and its attributes to the startElement function. Since Slashdot's file doesn't have any attributes we don't worry about it - but that is were they will be if a file does have them. Then it passes the data between the closing element and the starting element to characterData and finally it passes the ending element and its attributes to the endElement function. The endElement function is what calls the return_page() function, but only when it sees that we have hit the end of the story. Up until that point our variable $temp holds the data we have been collecting in startElement and characterData.
Now all that is left is to put a wget in your cron!
-- Joe