Using sockets in PHP : Get articles from Usenet

picture of Armel Fauveau
PHP can open sockets on remote or local hosts. Here is a hands-on example of using such a socket: getting connected to a Usenet News Server, talking to this server, and downloading some articles for a precise newsgroup.

Opening a socket in PHP

Sockets are opened using fsockopen(). This function is both available in PHP3 and PHP4. It uses the following prototype :

<?php

int fsockopen 
    
(string hostname
        
int port [, 
        
int errno [, 
        
string errstr [, 
        
double timeout]]])
?>
For the Internet domain, it will open a TCP socket connection to hostname on port port. hostname may in this case be either a fully qualified domain name or an IP address. For UDP connections, you need to explicitly specify the protocol: udp://hostname. For the Unix domain, hostname will be used as the path to the socket, port must be set to 0 in this case. The optional timeout can be used to set a timeout in seconds for the connect system call.
More information about fsockopen() : http://www.php.net/manual/function.fsockopen.php

Network News Transfer Protocol

Accessing a Usenet News Server requires using a specific protocol, called NNTP and standing for Network News Transfer Protocol.
This protocol is higly detailed in RFC977 (Request For Comment number 977), which is available at : http://www.w3.org/Protocols/rfc977/rfc977.html
This document described precisely how to connect to and then dialog with the NNTP server thanks to the various commands available for the task.

Connecting

Connecting to the NNTP server requires knowing its hostname (or IP address) and the port it is listening on. You should include a timeout so that an unsuccessful attempt at connecting does not "freeze" the application.

<?php

$cfgServer    
"your.news.host";
$cfgPort    119;
$cfgTimeOut    10;

// open a socket
if(!$cfgTimeOut)
    
// without timeout
    
$usenet_handle fsockopen($cfgServer$cfgPort);
else
    
// with timeout
    
$usenet_handle fsockopen($cfgServer$cfgPort, &$errno, &$errstr$cfgTimeOut);

if(!
$usenet_handle) {
    echo 
"Connexion failed\n";
    exit();
}    
else {
    echo 
"Connected\n";
    
$tmp fgets($usenet_handle1024);
}

?>

Using sockets in PHP : Get articles from Usenet

Talking to the Server

We are now connected to the server, and can talk to it through th previously opened socket. Let us say we want to get the 10 latest articles from some newsgroup. RFC977 specifies that the first step is to select the right newsgroup with the GROUP command :
GROUP ggg
The required parameter ggg is the name of the newsgroup to be selected (e.g. "net.news"). A list of valid newsgroups may be obtained from the LIST command. The successful selection response will return the article numbers of the first and last articles in the group, and an estimate of the number of articles on file in the group.
Example:
chrome:~$ telnet my.news.host 119
Trying aa.bb.cc.dd...
Connected to my.news.host.
Escape character is '^]'.
200 my.news.host InterNetNews NNRP server INN 2.2.2 13-Dec-1999 ready (posting ok).
GROUP alt.test
211 232 222996 223235 alt.test
quit
205 .
After receiving the command " GROUP alt.test", the News Server answered "211 232 222996 223235 alt.test". 211 is an RFC defined code (basically saying the command was succesfully executed - check the RFC for more details). It also answered it currently has 232 articles, indexed 222996 for the oldest through 223235 for the latest. These are called article numbers. Now, let us have a count here : 222996 + 232 by no means equals to 232235. The seven missing articles were removed one way or another from the server, either cancelled by their legitimate author (yes, it is possible and easy to do !) or deleted after report of abuse for example.
Be careful though, the server might require authentication before selecting the newsgroup, depending on wether it is a public or private server. It could also let anybody retrieve articles but require authentication to publish an article.

<?php

//$cfgUser    = "xxxxxx";
//$cfgPasswd    = "yyyyyy";
$cfgNewsGroup    "alt.php";

// identification required on private server
if($cfgUser) {
    
fputs($usenet_handle"AUTHINFO USER ".$cfgUser."\n");
    
$tmp fgets($usenet_handle1024);

    
fputs($usenet_handle"AUTHINFO PASS ".$cfgPasswd."\n");
    
$tmp fgets($usenet_handle1024);

    
// check error
    
    
if($tmp != "281 Ok\r\n") {
        echo 
"502 Authentication error\n";
        exit();
    }    
}

// select newsgroup

fputs($usenet_handle"GROUP ".$cfgNewsGroup."\n");
$tmp fgets($usenet_handle1024);

if(
$tmp == "480 Authentication required for command\r\n") {
    echo 
"$tmp\n";
    exit();
}    

$info split(" "$tmp);
$first $info[2];
$last $info[3];

print 
"First : $first\n";
print 
"Last : $last\n";


?>

Using sockets in PHP : Get articles from Usenet

Getting some articles

Now that we have the article number of the latest article, it is easy to get the latest ten articles. RFC977 says the ARTICLE command can be both used with the article number or the its Message ID.
Be careful here. The article number is different from its Message ID, as every news server will assign its own, so the article number of the same article will not be the same on two different news servers, whereas the message ID, included in the articles's header, is unique.

<?php

$cfgLimit    
10;

// upload last articles

$boucle=$last-$cfgLimit;

while (
$boucle <= $last) {

    
set_time_limit(0);

    
fputs($usenet_handle"ARTICLE $boucle\n");
    
    
$article="";
    
$tmp fgets($usenet_handle4096);
    if(
substr($tmp,0,3) != "220") {
        echo 
"+----------------------+\n";
        echo 
"Error on article $boucle\n";
        echo 
"+----------------------+\n";
    }
    else {
        while(
$tmp!=".\r\n") {
            
$tmp fgets($usenet_handle4096);
            
$article $article.$tmp;
        }
        
        echo 
"+----------------------+\n";
        echo 
"Article $boucle\n";
        echo 
"+----------------------+\n";
        echo 
"$article\n";
    }    

    
$boucle++;
}

?>
We just retrieved the ten latest articles available for this newsgroup on this server. It is also posible to get only the article's header, thanks to the HEAD command, or only the text using the BODY command.

Using sockets in PHP : Get articles from Usenet

Closing the connection

To end the session with the NNTP server, just close the socket using fclose() as you would close a file.

<?php

// close connexion

fclose($usenet_handle);

?>
More informatin about fclose() : http://www.php.net/manual/function.fclose.php

Conclusion

We just saw how to open, use then close a socket in a precise context : connecting to an NNTP server and getting back some newsgroup articles. Posting some articles on an NNTP server using the POST command is not much more complicated.
The next step is therefore coding an HTML news client (and get rid of Netscape:p).
It is also very easy to store the articles, index them using some search engine such as ht://dig (http://www.htdig.org/) and then you have a web based application for keyword searching some newgroups.
An example of such an application is available at http://www.phpindex.com/ng/
-- Armel