abc123
03-27-2006, 05:03 PM
Hello all,
I have recently become fascinated by how spiders traverse the web and index pages for search sites like Google. I have done a bit research and think that it could be done using php. Potentially it's already been done in php but I'll continue anyway...
So what I'm thinking is:
Open up the page as a file.
Use some string functions to take out data that is useful. (Title, Keywords etc...)
Take out any links from the page for traversing more pages - If there is multi-threading in php then I would imagine spawning more crawlers would be the way.
Index the current page with the link and somehow have it organised by most relevant depending on the search being performed. (This part I find puzzling)
I'm also aware that some pages don't want web crawlers so some headers tell the crawler not to traverse the page will need to be checked for.
Does anyone have any experience with this type of thing or any tips? I'm just really doing this to learn a few more tricks of the trade. Does what I have written seem to fit into what needs done?
Also I am slightly worried about creating these spiders and then just causing problems on peoples sites.
MarkR
03-27-2006, 05:41 PM
Hi, I've written a medium-large scale web spider. It's still an experimental prototype, over a few hours of runs it has fetched over 50,000 pages so far.
Open up the page as a file.
Well, you can use fopen(), but I've found it's not flexible enough. Its timeouts don't really have enough control, and its behaviour on redirects is a bit lame.
Therefore I wrote my own HTTP implementation (I normally advise against this for simple apps). This lets me use reasonable timeouts and keep tabs on exactly how it works. And it is correct.
Use some string functions to take out data that is useful. (Title, Keywords etc...)
Using string functions is a really bad idea, when PHP already has a perfectly good HTML parser (DOMDocument->loadHTML).
I'm using the DOM parser, although there are some issues with encoding - the PHP DOM parser assumes a default encoding (not sure what, and I don't think it can be changed) in the absence of a meta http-equiv content-type element in the head.
This is a problem at the moment, as a lot of documents on the web rely on HTTP headers for the encoding, which my spider does not (currently) respect.
Internally I store everything as utf8 of course.
Take out any links from the page for traversing more pages -
That is actually much harder than you imagine. Using a DOM you can easily find the links, but how do you determine where they point?
PHP's parseurl() function is a bit lame, it throws errors on some valid types of URL, instead of parsing them.
Moreover, a lot of pages use relative URLs. You need a function which can interpret relative URLs. You also need to figure out what schemes you support, take into account query strings. Links like
<a href="?blah=42">42 things</a>
Need to work, as well as ../ and ../../ etc.
If there is multi-threading in php then I would imagine spawning more crawlers would be the way.
There isn't a standard thread library (might be one in PECL, I haven't investigated). The obvious options are:
- Creating your own SAPI that spawns a number of threads
- Running it in multiple processes instead of threads.
I'm currently running my test jobs with 6 processes concurrently. This loads my machine up pretty well, especially seeing as I haven't optimised the amount of queries the spider does yet.
I store all the metadata in MySQL, including the queue of pages to index etc. Thus there are a lot of queries, to look up URLs in the database (to find out if we've already seen them, etc).
Also multiprocessing means doing things in the right order, and ensuring that there aren't database contention problems. MySQL MyISAM table type has fast table-level locking, which is pretty good for serialising operations you want to be atomic (although of course, this has disadvantages too).
I found that using InnoDB was hopeless, because it deadlocks (and rolls back your transaction) far too much if you have lots of processes doing writes at the same time.
Index the current page with the link and somehow have it organised by most relevant depending on the search being performed. (This part I find puzzling)
I'm not currently indexing them by topic at all. That is not my plan - rather I plan to make the spider go preferentially after certain types of page. I'm really interested in gathering technical data rather than words.
I'm also aware that some pages don't want web crawlers so some headers tell the crawler not to traverse the page will need to be checked for.
Yes. The robots exclusion protocol tells you in robots.txt, what directories, URI prefixes etc, it wants you to stay out of.
My spider understands robots.txt, but doesn't take any notice of robots meta tags (yet).
Does anyone have any experience with this type of thing or any tips?
Yes. See above.
Also I am slightly worried about creating these spiders and then just causing problems on peoples sites.
It's not really a big problem. If you simply make a rule that the spider must not visit the same site too often, it's fine. Because there are *soo* many other sites in the queue, all the processes just keep themselves busy with other sites until it's time to go back.
Mark
hutchic
03-27-2006, 06:37 PM
http://snoopy.sourceforge.net/
http://ca3.php.net/curl
abc123
03-28-2006, 02:44 AM
Thanks a lot for your feedback on this one.
MarkT - You've certainly given me a lot of things to think about and a good place to get started.
hutchic - I'm familiar with Curl but hadn't heard of this Snoopy thing. Will certainly have a good look at that one.
I'll be indexing pages in no time :)