I'm sure everyone who calls themselves a PHP coder has heard of RSS. I have to include the disclaimer, as I'm always surprised how little penetration concepts I take for granted have in the rest of the world. Recently, a colleague in the music industry started a blog. I saw there were no RSS feeds available, and asked him when they'd be ready. He had no clue what I was talking about. I shouldn't really have been surprised. Nevertheless, RSS feeds are starting to take the world by storm.
Recently I was looking for an RSS aggregator. I was having surprising difficulty finding one that did exactly what I wanted. Being quite impatient, especially when I'd spent more time looking than it would have taken me to write one had I started immediately, I began seriously considering writing my own. This month, I show you how to create a basic RSS reader yourself.
What is RSS?
I'm not interested in getting into the debate as to whether RSS stands for Really Simple Syndication, Rich Site Summary, RDF Site Summary, or anything else. Nor which of the RSS versions are 'better'. This article is not going to investigate the differences between the various RSS versions either. If we're going to be using PHP to read RSS, we need to rather know some of the basic technical details of what an RSS feed comprises of.
RSS is most simply understood as an implementation of XML. It contains (roughly, as versions differ!) the following elements (full specifications can be found at http://web.resource.org/rss/1.0/spec and http://blogs.law.harvard.edu/tech/rss:
* <title>, the feed title, or the name of the channel.
* <description>, a short piece of text describing the feed.
* <link>, the URL of the corresponding web page.
* <language>, the language used for the content (for more details see RFC1766, Tags for the Identification of Languages.
* <lastBuildDate>, the date and time the feed was updated.
* <pubDate>, the date and time the content was published.
* <copyright>, any copyright information.
* <generator>, the software used to generate the RSS.
* <docs>, a URL pointing to documentation for the format used in the RSS file (usually one of the specification links above).
* <ttl>, time to live, or how long the feed can be cached, in minutes.
* <image> An image related to the RSS feed. This in turn contains up to 6 sub-elements, the first three being mandatory:
o <title> Same function on the channel's title, and usually contains the same string.
o <url> Image URL.
o <link> Same function on the channel's image, and usually contains the same string.
o <width> In pixels, maximum of 144, default of 88.
o <height> In pixels, maximum of 400, default of 31.
o <description> The HTML title attribute.
* <rating> The PICS rating, originally designed for access control for parents and teachers. See the W3C specification.
* <cloud> Specifies a cloud web service which allows notification of updates.
* <textInput> Used to specify a text input box, which isn't usually that useful, and we don't look at here.
* <skipHours> This tells well-behaved aggregators when not to read the feed, reducing unecessary reads when nothing has changed. It consists of up to 24 <hour> elements, each containing a number from 0 to 23, representing an hour to be skipped.
* <skipDays> Similar to <skipHours>, this consists of <day> sub-elements, each listing a day to be skipped (written in full, such as Saturday).
* <webmaster> If you enjoy harvesting spam, you may also want to put the email address of the person responsible for technical issues.
* <managingEditor> Another spam harvester, this time the email of the editor responsible for the content.
* <item> Most importantly, it also contains a number of items, which are usually the articles, stories or posts you're interested in reading. It contains a number of sub-elements, all of which are optional, although one of either title or description must be present.
o <title> Usually the article headline.
o <description> Usually a short blurb, although it can contain the entire contents.
o <link> URL to the item.
o <author> More potential spam gathering, this one is for the email address of the item's author
o <category> String identifying the category
o <comments> URL for item comments
o <enclosure> Used for attached video or audio files. It has 3 attributes - URL, length (in bytes) and type.
o <guid> A unique identifier (Globally Unique Identifier). Usually a URL. Has an attribute, is PermaLink, which tells the reader whether the guid is a URL to the item (it defaults to true).
o <pubDate> Date and time when the item was published
o <source> Text string of the source, it also has a required attribute, url, which links to the source's XMLization
After all that's let's look at a sample, loosely based on that of the PHPBuilder website. For simplicity I've included just two items. It's in RSS 0.91 format.
<pubDate>Thu, 29 Sep 2006 15:16:13 GMT</pubDate>
<description>Newest Articles and How-To's on PHPBuilder.com</description>
<title>PHPBuilder.com New Articles</title>
<title>In Case You Missed It...The Week of September 26, 2006</title>
<description>This week Elizabeth brings us news of an upcoming free
webcast called Design Patterns in PHP, the schedule for the Fall Zend conference,
security alerts for Moveable Type and phpBB, the release of Zend Platform 2,
XAMPP for Linux, the latest PEAR/PECL releases and much more!
</description> </item> <item>
<title>In Case You Missed It...The Week of September 19, 2006</title>
<description>This week Elizabeth brings us news of the release of PEAR 1.4,
Zend Studio 5 Beta, a security vulnerability with PHP-Nuke, the release of a
SimpleTest plugin for PHPEclipse, a patch for phpMyAdmin, the latest PEAR/PECL
releases and much, much more!</description>