![]() Join Up! 96823 members and counting! |
|
|||
XML How-To
Joe Stump
Recently, at work I was given the job of learning XML. OK, so it wasn't technically
XML, it was RDF, but I found that PHP's XML parsing functions worked just the same.
At work I parsed out DMOZ (http://www.dmoz.org) but for simplicity I will stick with
the basics of XML and then leave you free to parse DMOZ in your extra time ;o)
To begin with you will want to make sure that you have a PHP binary compiled with
the '--with-xml' option enabled. Once that is complete you are ready to start parsing
XML. Next grab Slashdot's XML file from their homepage (www.slashdot.org/slashdot.xml).
Slashdot has a fairly simplistic file that is extremely easy to parse.
Remember that when you are working with XML it is a lot like working with a table in
a database. You have a result index in the xml parser and a psuedo table in the XML
document. Once you get over the differences you will be parsing in no time.
PHP's XML functions allow you to specify three functions that will handle the data in
the XML file. One handles opening tags, one hands the data between tags, and the
third handles the ending tags. Based on the name of the tags it gets passed you can
then manipulate the data however you please. To begin with you need to look at your
XML document and find out what tags are in the file. In our slashdot file we have
STORY, TITLE, URL, TIME, AUTHOR, DEPARTMENT, TOPIC, COMMENTS, SECTION, and IMAGE.
In some cases you would have attributes, and example is HREF is an attribute to A in
HTML. PHP has an extremely cool way of handling attributes automagically. Next we
need to define those tags in our script.
I only want to parse out the above data because I just want to make one of those
cool Slashboxes. Next on our list is to make the functions that will extract this
data. On the following page are the functions that I created to do so.
As you can see so far XML parsing in PHP isn't all that bad. Now for the
fun part - parsing the file! For that you will need the rest of the code, which is fairly simple.
Now this is what happens: PHP starts parsing along until if finds
<ELEMENT ATTRIBUTE='bold'>
it then passes ELEMENT and its attributes to the startElement function. Since Slashdot's
file doesn't have any attributes we don't worry about it - but that is were they will be if
a file does have them. Then it passes the data between the closing element and the starting
element to characterData and finally it passes the ending element and its attributes to the
endElement function. The endElement function is what calls the return_page()
function, but only when it sees that we have hit the end of the story. Up until that point
our variable $temp holds the data we have been collecting in startElement and characterData.
Now all that is left is to put a wget in your cron!
-- Joe
|