I'm sure everyone who calls themselves a PHP coder has heard of RSS. I have to include the disclaimer, as I'm always surprised how little penetration concepts I take for granted have in the rest of the world. Recently, a colleague in the music industry started a blog. I saw there were no RSS feeds available, and asked him when they'd be ready. He had no clue what I was talking about. I shouldn't really have been surprised. Nevertheless, RSS feeds are starting to take the world by storm.
Recently I was looking for an RSS aggregator. I was having surprising difficulty finding one that did exactly what I wanted. Being quite impatient, especially when I'd spent more time looking than it would have taken me to write one had I started immediately, I began seriously considering writing my own. This month, I show you how to create a basic RSS reader yourself.
What is RSS?
I'm not interested in getting into the debate as to whether RSS stands for Really Simple Syndication, Rich Site Summary, RDF Site Summary, or anything else. Nor which of the RSS versions are 'better'. This article is not going to investigate the differences between the various RSS versions either. If we're going to be using PHP to read RSS, we need to rather know some of the basic technical details of what an RSS feed comprises of.
RSS is most simply understood as an implementation of XML. It contains (roughly, as versions differ!) the following elements (full specifications can be found at http://web.resource.org/rss/1.0/spec and http://blogs.law.harvard.edu/tech/rss:

    * <title>, the feed title, or the name of the channel.
    * <description>, a short piece of text describing the feed.
    * <link>, the URL of the corresponding web page.
    * <language>, the language used for the content (for more details see RFC1766, Tags for the Identification of Languages.
    * <lastBuildDate>, the date and time the feed was updated.
    * <pubDate>, the date and time the content was published.
    * <copyright>, any copyright information.
    * <generator>, the software used to generate the RSS.
    * <docs>, a URL pointing to documentation for the format used in the RSS file (usually one of the specification links above).
    * <ttl>, time to live, or how long the feed can be cached, in minutes.
    * <image> An image related to the RSS feed. This in turn contains up to 6 sub-elements, the first three being mandatory:
          o <title> Same function on the channel's title, and usually contains the same string.
          o <url> Image URL.
          o <link> Same function on the channel's image, and usually contains the same string.
          o <width> In pixels, maximum of 144, default of 88.
          o <height> In pixels, maximum of 400, default of 31.
          o <description> The HTML title attribute.
    * <rating> The PICS rating, originally designed for access control for parents and teachers. See the W3C specification.
    * <cloud> Specifies a cloud web service which allows notification of updates.
    * <textInput> Used to specify a text input box, which isn't usually that useful, and we don't look at here.
    * <skipHours> This tells well-behaved aggregators when not to read the feed, reducing unecessary reads when nothing has changed. It consists of up to 24 <hour> elements, each containing a number from 0 to 23, representing an hour to be skipped.
    * <skipDays> Similar to <skipHours>, this consists of <day> sub-elements, each listing a day to be skipped (written in full, such as Saturday).
    * <webmaster> If you enjoy harvesting spam, you may also want to put the email address of the person responsible for technical issues.
    * <managingEditor> Another spam harvester, this time the email of the editor responsible for the content.
    * <item> Most importantly, it also contains a number of items, which are usually the articles, stories or posts you're interested in reading. It contains a number of sub-elements, all of which are optional, although one of either title or description must be present.
          o <title> Usually the article headline.
          o <description> Usually a short blurb, although it can contain the entire contents.
          o <link> URL to the item.
          o <author> More potential spam gathering, this one is for the email address of the item's author
          o <category> String identifying the category
          o <comments> URL for item comments
          o <enclosure> Used for attached video or audio files. It has 3 attributes - URL, length (in bytes) and type.
          o <guid> A unique identifier (Globally Unique Identifier). Usually a URL. Has an attribute, is PermaLink, which tells the reader whether the guid is a URL to the item (it defaults to true).
          o <pubDate> Date and time when the item was published
          o <source> Text string of the source, it also has a required attribute, url, which links to the source's XMLization

After all that's let's look at a sample, loosely based on that of the PHPBuilder website. For simplicity I've included just two items. It's in RSS 0.91 format.

<?xml version="1.0"?>
<rss version="0.91">
<channel>  
<pubDate>Thu, 29 Sep 2006 15:16:13 GMT</pubDate>  
<description>Newest Articles and How-To's on PHPBuilder.com</description>
<link>http://phpbuilder.com</link>  
<title>PHPBuilder.com New Articles</title>  
<webMaster>staff@phpbuilder.com</webMaster>  
<language>en-us</language>  <item>   
<title>In Case You Missed It...The Week of September 26, 2006</title>   
<link>http://www.phpbuilder.com/columns/weeklyroundup20060926.php3</link>   
<description>This week Elizabeth brings us news of an upcoming free
webcast called Design Patterns in PHP, the schedule for the Fall Zend conference,
security alerts for Moveable Type and phpBB, the release of Zend Platform 2,
XAMPP for Linux, the latest PEAR/PECL releases and much more!
</description>  </item>  <item>   
<title>In Case You Missed It...The Week of September 19, 2006</title>   
<link>http://www.phpbuilder.com/columns/weeklyroundup20060919.php3</link>   
<description>This week Elizabeth brings us news of the release of PEAR 1.4,    
Zend Studio 5 Beta, a security vulnerability with PHP-Nuke, the release of a
SimpleTest plugin for PHPEclipse, a patch for phpMyAdmin, the latest PEAR/PECL
releases and much, much more!</description>  
</item>
</channel>
</rss>

Using PHP to parse RSS
Save the above xml as a file called phpbuilder.rss, as we're going to use it in the following examples. You can of course use any existing RSS feed out there, but for demonstration purposes it'll be easier if your example is exactly the same as the one I'm using. Let's look at the built-in PHP functions we'll be using.

    * xml_parser_create()
    * xml_set_element_handler($xml_resource,$start_element_function,$end_element_function)
    * xml_set_character_data_handler($xml_resource,$character_data_handler)
    * xml_parse($xml_resource,$data)

The first, xml_parse_create() creates an XML parser and returns a resource handle, used by the other functions. Since an XML feed is made up of a number of elements, and these could differ from feed to feed, we need to apply logic as we traverse through the feed. Elements could contain multiple sub-elements and attributes. To assist with this, the xml_set_element_handler() defines functions to be called dependant on whether an element has opened or closed. It takes three arguments, the first being the XML resource handle returned by xml_parser_create, while the second is is the name of a function automatically called when a new element is reached during parsing, and the third the name of the function automatically called when the end of an element is reached during parsing. (The latter two arguments can also be arrays containing an object name and method reference, which we don't look at here). Calling a function when we reach the start and end of an element is all very well, but we also need to perform some logic when we're actually parsing the characters. xml_set_character_data_handler is the function that determines this. The first of its two arguments is of course the XML resource handle, and the second the name of the function called during parsing.
xml_parse() is the function to call to actually start parsing the feed, and takes the XML resource handle as the first argument, and a string containing the portion of the feed as the second. An optional third argument can assist the logic by indicating whether the string is the last piece of data in this parse, but we also don't look at that this time. Let's start with a simple skeleton to see how this all works:

<?php$rssFeeds = array ('phpbuilder.rss');
// for now we'll just have the one file, but this can later be expanded
//Loop through the array (just one element for now) and read the feedforeach
($rssFeeds as $feed) {  readFeeds($feed);}
// The function to be called when a start element is read. For now we'll
// just echo some outputfunction
startElement($xp,$name,$attributes) {  
echo "Start $name <br>";}
function endElement($xp,$name) {  
echo "End: $name<br>";}

function readFeeds($feed) {  
$fh = fopen($feed,'r');
// open file for reading  
$xp = xml_parser_create();
// Create an XML parser resource  
xml_set_element_handler($xp, "startElement", "endElement");
// defines which functions to call when element started/ended  
while ($data = fread($fh, 4096)) {    
if (!xml_parse($xp,$data)) {      
return 'Error in the feed';    }  }}?>

If you run this script, it'll output the following:

Start RSSStart CHANNELStart PUBDATEEnd: PUBDATEStart DESCRIPTIONEnd:
DESCRIPTIONStart LINKEnd: LINKStart TITLEEnd: TITLEStart WEBMASTEREnd:
WEBMASTERStart LANGUAGEEnd: LANGUAGEStart ITEMStart TITLEEnd: TITLEStart
LINKEnd: LINKStart DESCRIPTIONEnd: DESCRIPTIONEnd: ITEMStart ITEMStart
TITLEEnd: TITLEStart LINKEnd: LINKStart DESCRIPTIONEnd: DESCRIPTIONEnd:
ITEMEnd: CHANNELEnd: RSS

Note how startElement() and endElement() are called. It's important you understand this, which is why I've created this skeleton, and not particularly useful, piece of code first. This mechanism may be tricky to understand at first, but it quite fundamental to the way XML is parsed.
Now let's add the character data handler function. This is the one that actually does something with the data between the open and close tags. Make the changes shown in bold below:

<?php

$rssFeeds = array ('phpbuilder.rss');

//Loop through the array, reading the feeds one by one
foreach ($rssFeeds as $feed) {
  readFeeds($feed);
}
function startElement($xp,$name,$attributes) {
  echo "Start $name
";
}

function endElement($xp,$name) {
  echo "End: $name
";
}


function characterDataHandler($xp,$data) {
  echo "Data: $data
";
}

function readFeeds($feed) {
  $fh = fopen($feed,'r');
// open file for reading

  $xp = xml_parser_create();
// Create an XML parser resource

  xml_set_element_handler($xp, "startElement", "endElement");
// defines which functions to call when element started/ended

  xml_set_character_data_handler($xp, "characterDataHandler");

  while ($data = fread($fh, 4096)) {
    if (!xml_parse($xp,$data)) {
      return 'Error in the feed';
    }
  }
}
?>

Start
RSSData:Data:Start
CHANNELData:Data:Start
PUBDATEData: Thu, 29 Sep 2006 15:16:13
GMTEnd: PUBDATEData:Data:Start
DESCRIPTIONData: Newest Articles and How-To's on PHPBuilder.comEnd:
DESCRIPTIONData:Data:Start LINKData: http://phpbuilder.comEnd:
LINKData:Data:Start TITLEData: PHPBuilder.com New ArticlesEnd:
TITLEData:Data:Start WEBMASTERData: staff@phpbuilder.comEnd:
WEBMASTERData:Data:Data:Start LANGUAGEData: en-usEnd: LANGUAGEData:Data:Start
ITEMData:Data:Start
TITLEData: In Case You Missed It...The Week of September 26, 2006End:
TITLEData:Data:Start
LINKData: http://www.phpbuilder.com/columns/weeklyroundup20060926.php3End:
LINKData:Data:Start
DESCRIPTIONData: In Case You Missed It...The Week of September 26, 2006End:
DESCRIPTIONData:Data:End: ITEMData:Data:Start ITEMData:Data:Start
TITLEData: In Case You Missed It...The Week of September 19, 2006End:
TITLEData:Data:Start
LINKData: http://www.phpbuilder.com/columns/weeklyroundup20060919.php3End:
LINKData:Data:Start
DESCRIPTIONData: In Case You Missed It...The Week of September 19, 2006End:
DESCRIPTIONData:Data:End: ITEMData:Data:End:
CHANNELData:End: RSS

Note that the empty Data: rows are from extra spaces in the phpbuilder.rss file. Now that we have a good idea how the mechanism works, let's do some useful parsing. We're going to keep things easy to follow, if not particularly elegant, and use global variables to keep track of what's happening. Rewrite the three functions as follows:

function startElement($xp,$name,$attributes) {  
global $item,$currentElement;  $currentElement = $name;
//the other functions will always know which element we're parsing  
if ($currentElement == 'ITEM') {
//by default PHP converts everything to uppercase    
$item = true;
// We're only interested in the contents of the item element.
////This flag keeps track of where we are  
}}

function endElement($xp,$name) {  
global $item,$currentElement,$title,$description,$link;    
if ($name == 'ITEM') {
// If we're at the end of the item element, display
// the data, and reset the globals    
echo "<b>Title:</b> $title<br>";    
echo "<b>Description:</b> $description<br>";    
echo "<b>Link:</b> $link<br><br>";    
$title = '';    
$description = '';    
$link = '';    
$item = false;  }}
function characterDataHandler($xp,$data) {  
global $item,$currentElement,$title,$description,$link;    
if ($item) {
//Only add to the globals if we're inside an item element.    
switch($currentElement) {      
case "TITLE":        
$title .= $data;
// We use .= because this function may be called multiple
// times for one element.        
break;      
case "DESCRIPTION":        
$description.=$data;        
break;      
case "LINK":        
$link.=$data;        
break;     }  }}

Here's what the above changes attempt to do. We need to know which particular element we're working with at any one time. Inside startElement(), we create a global variable, $currentElement, which will be set every time startElement() is called. It will be assigned a string containing the name of the current element. By default PHP uses what it calls case folding, which means that it automatically makes everything uppercase. Then, in characterDataHandler(), we'll check what this variable is set to (using the switch statement), and assign the data, or the contents of the tag, to an appropriately named variable (one of $title, $description or $link). These will also be global, as they will be used for display purposes in endElement(). For now we'll only worry about these three compulsory elements - you can easily extend this to include other, optional, elements later. Here's what the script incorporating the above changes outputs:

Title: In Case You Missed It...The Week of September 26, 2006
Description: This week Elizabeth brings us news of an upcoming free
webcast called Design Patterns in PHP, the schedule for the Fall Zend conference,
security alerts for Moveable Type and phpBB, the release of Zend Platform 2,
XAMPP for Linux, the latest PEAR/PECL releases and much more!
Link: http://www.phpbuilder.com/columns/weeklyroundup20060926.php3

Title: In Case You Missed It...The Week of September 19, 2006
Description: This week Elizabeth brings us news of the release of PEAR
1.4, Zend Studio 5 Beta, a security vulnerability with PHP-Nuke, the release
of a SimpleTest plugin for PHPEclipse, a patch for phpMyAdmin, the latest
PEAR/PECL releases and much, much more!
Link: http://www.phpbuilder.com/columns/weeklyroundup20060919.php3

Conclusion
It's been more convoluted than you may perhaps have expected (compare this to reading and parsing a file!) but you should now be able to successfully make some use of a basic RSS feed. In part 2, next week, we look at combining multiple feeds, and making use of some of the other elements. Until then!