Over at Developer.com I recently penned an article titled Implement Data Indexing and Search with Lucene and Solr, which introduced readers to the powerful Apache Lucene text search engine library. In my opinion, one of the most important takeaways of that article was the understanding that Lucene makes document-based text search possible but it is not itself a search application. An end user cannot simply plug into it and begin sifting through a pile of electronic documents!
Rather, you take advantage of Lucene either by writing custom Java code capable of indexing the desired documents and providing a search interface (as I demonstrated in the aforementioned article using Lucene's bundled demo), or by using one of the several Lucene implementations written in languages such as Perl, Python, or Ruby. You also have the option of looking towards one of the search platform implementations built atop Lucene, such as Solr (also introduced in the Developer.com article).
In this article I'll show you how to undertake the former approach using PHP's most prominent Lucene implementation, which also happens to be part of the Zend Framework: the Zend_Search_Lucene component.
The Zend_Search_Lucene component is a PHP 5-based Lucene implementation capable of indexing and searching several document types, among them HTML, Excel 2007, PowerPoint 2007, Word 2007, and XML. Additionally, you can use this component to supplant MySQL's useful but limited full-text search feature. Like all Zend Framework components, you can tightly integrate Zend_Search_Lucene into your Zend Framework applications, or use it separately within any PHP application. I've documented the latter approach in the PHPBuilder.com article, Running PHP and Zend Framework Scripts from the Command Line. For the purposes of this demonstration I'll show you how to integrate the component into a Zend Framework application.
PHP + Lucene: Indexing a Database
Suppose you created an online service for job seekers, allowing them to generate an appealing downloadable resume simply by entering their contact information, education and employment history into a Web form. In addition to resume generation, you enter the job seeker's information into a database, which employers can search in return for a small monthly fee.
You'd like to tout employers' ability to perform power searches that allow them to comb over every conceivable characteristic of job seekers' resumes, including being able to retrieve only resumes containing a specific term or phrase, and those that specifically do not contain a particular term or phrase. Sounds like a job made for Lucene, thanks to its powerful query parser syntax (see the Zend_Search_Lucene documentation for a list of minor differences from the original Lucene implementation)!
In order to make a newly uploaded resume immediately available to prospective employers, we'll index each job seeker's information at the time it's added to the database. To do so, we'll index the searchable data and an associated identifier that links that data to its database record.
... insert resume data into database
// Retrieve the last insert ID
$id = $db->lastInsertId();
$index = Zend_Search_Lucene::open('/var/www/dev.example.com/lucene-index');
$doc = new Zend_Search_Lucene_Document();
This snippet begins by opening the Lucene index using the static open() method. Unfortunately, the Zend_Search_Lucene component requires you to use a separate static method named create() in order to create the method. Therefore, you'll want to run the create() method separately before opening the index.
Additionally, this snippet adds four fields to the index:
dbid -- represents the primary key associated with the record just added to the database
name -- contains the job seeker's name
education -- contains the job seeker's provided experience
experience -- contains the job seeker's supplied experience
Each of these fields are identified by a specific field type. The Unindexed type identifies a non-searchable field that is returned with search results. The Text type can both be searched and is returned with the search results. The UnStored type identifies data that should be tokenized and indexed, but not stored in its entirety within the index. This is useful when you're using Zend_Search_Lucene in conjunction with a database.
Still other field types exist; be sure to consult the documentation and Zend_Search_Lucene source code for more details.
PHP + Lucene: Searching Documents
With several resumes indexed, let's turn our attention to the search interface. Starting with a simple scenario, suppose an employer wanted to search for all job seekers who mentioned Dell somewhere within their resumes:
Any returned search results are iterated over and output to the browser with a URL generated, which presumably points to a controller action that will use the provided primary key to retrieve more information about the job seeker. Additionally, a score is included which indicates the quality of the match. You can optionally limit the number of results returned using the setResultSetLimit() method (see the documentation for more information).
You can narrow the search to a specific indexed field by modifying the search query to look something like this:
$query = 'name: "Gilmore"';
It's also possible to define queries which search a specific indexed field:
$term = new Zend_Search_Lucene_Index_Term('Ohio', 'education');
$query = new Zend_Search_Lucene_Search_Query_Term($term);
$results = $index->find($query);
You can even combine terms to produce more complex queries. For instance, the following example will return only job seekers who mention the term Ohio in their education with the qualifier that Michigan is not also referenced in the same field:
The Zend_Search_Lucene component isn't without its limitations; for instance, only indexes of up to 2GB are currently supported on 32-bit operating systems. However, if you're not intent on building the next Monster.com, chances are it's going to suit your needs quite nicely!