Introduction:

I started working with PHP six months ago. I used to read many articles in Internet that gave me better understanding on PHP. I started developing software for “Online Journals” that has the capability of searching document’s contents. You can find articles in devarticles.com that can perform keyword title and author search. This article gives you a brief idea of Document-Based Search.
What is Document Search?
In a Dynamic Document Search every word in the document is parsed (read) and matched with the search words. Results are displayed based on the matches found.
Reading every word of the article matching it with the search word over thousands or even lakhs of documents is very difficult task. Also by default, PHP is configured to run maximum 30 seconds.

Prerequisites:

To understand this article, you should have a fair knowledge of PHP. To run examples given in your machine, you need APACHE, PHP, and MYSQL software installed and configured. I used PHP Version 4.3.1 and MYSQL 2.2.3.

Building Database:

The database consists of three tables. viz. Content Table, Keyword Table, Link Table. Content table holds article’s title, and abstract. Keyword table holds keyword. Keyword field is indexed. Link table holds keyword id, content id.
The SQL Statement for creating these three tables are shown below.

Content Table:


CREATE TABLE content ( 
contid mediumint(9) NOT NULL auto_increment, 
title text,
abstract longtext, 
PRIMARY KEY (contid) ) TYPE=MyISAM; 

Keyword Table:


CREATE TABLE keytable (
 keyid mediumint NOT NULL auto_increment, 
keyword varchar(100) default NULL, 
PRIMARY KEY (keyid), 
KEY keyword (keyword) ) TYPE=MyISAM; 

Link Table:


CREATE TABLE link ( 
keyid mediumint NOT NULL, 
contid mediumint NOT NULL)
TYPE=MyISAM 

Preparing Database:

An input interface with HTML form is created to enter title and document. After filling and hitting enter, the title and the abstract is stored in the content table. The generated new content id is stored in a variable temporarily. In the next step and ‘Upload Engine’ that parses each word in the abstract and process the whole text. It removes common words like is, was, and, if, so, else, then etc. Then stores each word in wordmap array. See that every word has only one entry in the wordmap array.
For every word in the wordmap array, keyword table is parsed and math is found. If there is a match, the generated key id, and content id generated id earlier is stored in the link table. Else, the new keyword is inserted in the keyword table and with the generated keyword table and content id the link table is updated. And thus we finished preparing our database.
The code snippet given below explains every step of the program.
Searching keyword table for every word is a long process. This also reduces the efficiency of the program. To implement this all the keywords in the keyword table is stored in an associative array $allWords. An associative array is one, which works on B-Tree algorithm and very useful to perform searches. Here is the function.

<?php
Function LoadCurrentWords(){
global 
$allWords;

    
$result mysql_query"select keyid, keyword from keytable" ) or die( "Error in executing mysql query" );

    while ( 
$row mysql_fetch_array($result) ) {
        
$allWords[$row[‘keyword’] = $row[‘keyid’];
    }
}
?>

Common Words:

$COMMON_WORDS is an associative array that stores an array of words, which are commonly used in English Language. These words have to be removed while parsing the file.
$COMMON_WORDS=array(“a”=>1, “as”=>1);
You can add as many common words as you like. See source code for full list of common words.

ExtractWords() Function:

This function filters words by allowing only alphabetic characters. To implement this, I used a technique called STATE MACHINE that filters the characters.
Alphabetic characters are taken as STATE1 and other characters (Numeric and Special Characters) as STATE0. Initially the machine will be in the STATE0. While parsing letters, it encounters alphabetic characters, the machine switches to STATE1 else it will remain in the same state. As a result we get a word with only alphabetic characters.

<?php
function ExtractWords($text){
    
$STATE0 0;  //Numeric / Other Characters
    
$STATE11;   //Alpha Characters
    
$state = $ STATE0;

    
$wordList = array();
    
$curWord "";

    for ( 
$i 0$i strlen($text); ++$i ) {
        
$ch $text{$i};
        
$isAlpha ctype_alpha$ch );

        if ( 
$state == $STATE0) {
            if ( 
$isAlpha ) {
                
$curWord $ch;
                
$state $STATE1;
            }
        }
        else if ( 
$state == $STATE1) {
            if ( 
$isAlpha ) {
                
$curWord .= $ch;
            }
            else {
                
$wordList[] = strtolower$curWord );
                
$state = $ STATE0;
            }
        }
    }

    if ( 
$state == $ STATE1) {
        
$wordList[] = strtolower$curWord );
    }

    return 
$wordList;
}
?>
As a result we get a list of words stored in an array returned to the called function.
FilterCommonAndDuplicateWords() Function:
This function is called after ExtractWords() function. This parses filtered words removes common words like ‘a’,’is’, ’was’,’and’…. Other words are taken as valid words, remove duplicate among them and then stored in an associative array $wordMap and this array is returned to the called function.

<?php
function FilterCommonAndDuplicateWords$wordList ) {
    global 
$COMMON_WORDS;
    global 
$MAX_WORD_LENGTH;

    
$wordMap = array();

    foreach ( 
$wordList as $word ) {
        
$len strlen$word );
        if ( (
$len 1) && ($len $MAX_WORD_LENGTH) ) {
            if ( !
$wordMap[$word] ) {
                if ( !
$COMMON_WORDS[$word] ) {
                    
$wordMap[$word] = 1;
                }
            }
        }
    }
?>

Process Form function():

This is the core part of the upload program. After finishing filtering, removing common words and duplicate words, this function is called. First this function inserts the title and abstract in the content table. The newly generated content id stored in $contentId. Then it updates keyword and link table.
For every word in the $wordMap array, if the word is already exists in keyword table, it inserts the key id, content id in to link table. Conversely, if the word is not found, it inserts the new word in keyword table, the generated new key id is stored in $keyId. Then it updates link table by inserting key id content id in link table.

<?php
function ProcessForm($title ,$body){

global   
$allWords;

$tempWordList ExtractWords$body );
$wordList FilterCommonAndDuplicateWords($tempWordList);

// insert into content
mysql_querysprintf"INSERT INTO content (title, abstract) VALUES ('%s', '%s')",
mysql_escape_string($title), mysql_escape_string($body) ) );

//store the newly generated content id in $contentId
$contentId mysql_insert_id();

    
// insert all the new words and links
    
while(list($word,$val)=each($wordList)) {
        
$keyId "";
        if ( !
$allWords[$word] ) {
            
mysql_querysprintf"INSERT INTO keytable ( keyword ) VALUES ( '%s' )",
                
mysql_escape_string($word) ) );

            
$keyId mysql_insert_id();
            
$allWords[$word] = $keyId;
        }
        else {
            
$keyId $allWords[$word];
        }

        
// insert the link
        
mysql_querysprintf"INSERT INTO link (keyid, contid) VALUES ( %d, %d )"$keyId$contentId ) );
    }
//End of Processing Form.

}
?>
The following code snippet is the starting place of execution, which calls all the above functions. Here it connects to database server and database. Initially form() function is called that allows you to enter the title and abstract of the document.

<?php
if($submit){

    global 
$allWords;

    
mysql_connect"localhost""root""" ) or die( "Unable to connect to database" );
    
mysql_select_db"kpp" ) or die( "Unable to select database" );

    
LoadCurrentWords();

    if ( 
$title and $body){
            
ProcessForm($title ,$body);
        }

}else{ 
//end of main
       
$err="Please fill in the fields to upload\n";
       
form($err);
}

function 
form($errmsg)
{  
?>
   <h4 align="center">File Parser & Uploader</h4>
   <b><?php echo $errmsg?></b>
   <center>
   <form method="POST" action=<?php echo $PHP_SELF ?>>
   Title:   <input type="text" name="title" ><p>
   Abstract:   <input type="text" name="body" ><p>
     <input type="submit" name="submit" value="Start Parsing and Upload Content">
   </table>
   </form>

   </center>
<?php
}
?>

Search Engine:

PHP script is written that makes it possible to query the database through a HTML form. This will work as any other search engine: the user enters a word in a textbox, hits enter, and the interface presents a result page with links to the pages which contains the word that is searched for.
In this example, the results are displayed the order in which the pages are presented is selected by the number of search words appeared in each document.
Declare an associative array $CommonWords that contains common words like ‘is’, ‘in’, ‘was’ etc.
First convert all the search words in to lower case.
$search_keywords=strtolower(trim($keywords));

Next, we have to perform an explode operation on search words that will store each search word in an array. The code is shown here.
$arrWords = explode(" ", $search_keywords);

Next, remove duplicate words in $arrWords.
$arrWords = array_unique($arrWords);
In a search operation, first we have to remove the common words like ‘is’, ‘in’, ‘was’ … This refines our search criteria. To implement this we store common words in an associative array $CommonWords.
Next, remove common words in the search words. Search words are stored in $searchWords and common words are stored in $junkWords. Here is the code.

<?php
        $searchWords
=array();
        
$junkWords=array();
        foreach(
$arrWords as $word)
            
//remove common words
            
if(!$CommonWords[$word]){
                
$searchWords[]=$word;
            }else{
                
$junkWords[]=$word;
           }
?>
We can display results in two ways.
Type 1: Display the document if all the search words present in the document.
Type 2: Display the document if any one of the search words is present.
If you want to perform the Type 1 operation, include the following code snippet in to your program.
//count no of words in the search words and store in a variable
        $noofSearchWords=count($searchWords);
$noofSearchWords stores the number of search words. Later after searching search words in key word table we get results. There we can perform logical AND operation that will display our desired results. If $noofSearchWords is equal to number of records, the next part of the program gets executed. Else “NO SEARCH RESULT FOUND” is displayed.
In the next step, we have to search for words in $searchWords array in the keyword table. The following code snippet will return you a list of keyids that matched query.

<?php
        
//implode to an array
        
$arrWords implode("'  OR keyword='"$searchWords);

        
//get the key ids from the key table
        
$query "select * from keytable where keyword='$arrWords'";

        
$kResult mysql_query($query);
?>
As discussed earlier, if you need to perform Type 1 operation, you have check whether the number of search words and number of records in query. If they are equal, you can proceed to the next step else display search result not found. Here is the code.

<?php
     
if(mysql_num_rows($kResult) == $noofSearchWords){

    
//search for the keyids in the link table and get the content id
    //Fetch title, first 200 words of the abstract in to an array
    //Display the result
    
}else{
    echo 
“NO SEARCH RESULT FOUND”;
   }
?>
The following code searches the link table for occurrences key ids. This will return an array that contains the content ids.

<?php
while($kRow=mysql_fetch_array($kResult))
{
     
//get the link ids for each key id
     
$kid$kRow['keyid'];
     
$query "SELECT * FROM link WHERE keyid=$kid";
     
$lResult mysql_query($query);
      
//echo mysql_num_rows($lResult);
      
while($lRow=mysql_fetch_array($lResult))
      {
          
$thisContentId=$lRow["contid"];
          if(!
$contArray[$thisContentId]){
              
$contArray[$thisContentId]=1;
          }else{
              
$contArray[$thisContentId]++;
          }
      }
 }
//end of while
?>
Sort the array in descending order of the key value. This will order from highest occurrences to the lowest. For example, if the number of search words is four, the order is displayed 4 then 3 then 2 and last 1.
//Sort array in descending order of the key value
                arsort($contArray,SORT_DESC);
In the next step we have to fetch title, first 200 words in content table in to an array $FoundRef.

<?php
//declare an array to store the results
                
$FoundRef=array();

while(list(
$contentId,$occurances)=each($contArray)){

    
$aQuery "select contid,title,left(abstract,200) as summary from content where contid = " $contentId;
    
$aResult mysql_query($aQuery);

    if(
mysql_num_rows($aResult) > 0){
        
$aRow mysql_fetch_array($aResult);
        
$FoundRef[] = array (
                          
"contid" => $aRow["contid"],
                          
"title" => $aRow["title"],
                          
"summary" => $aRow["summary"],
                          
"occurance"=>$occurances
              
);
    }
//end of  if
}
?>
Finally we have to display the results in the browser. Here is the code.

<?php
if(isset($FoundRef))
{
    echo 
"<table width=\"100%\"><tr><th class=\"title\">Search Result</td></tr></table>";
    echo 
"<a href=\"#\" onclick=\"history.back()\">Back</a>";
    echo 
"<br />";
    echo 
sizeof($FoundRef);
    echo (
sizeof($FoundRef) == " reference" " references");
    echo 
" found";
    echo 
"<p>";
if(
$junkWords){
        echo 
"Common words like";
        foreach(
$junkWords as $jWords){
            echo 
"&nbsp"."'".$jWords."'";
        }
        echo 
"are removed from the search string";
    }
    echo 
"</h5>";
    foreach(
$FoundRef as $a => $value)
    {
        echo 
"<table>";
        echo 
"<tr><td valign=\"top\">";
       
// echo $FoundRef[$a]["contid"];
        
?>

            <a href="showref.php?refid=<?php echo $FoundRef[$a]["contid"]?>"><emp><b><?php echo $FoundRef[$a]
            [
"title"]?></b></emp></a><div align="right"> Occurance(s): <?php echo $FoundRef[$a]["occurance"?></div>

            <br /><small><?php echo $FoundRef[$a]["summary"?>...</small><br /><br />
            <?php echo "</td></tr>";
            echo 
"</table>";
    }
}
//end of isset FoundRef
?>
The HTML page to get input from user is given below.

<html>
<head>
<title>Search Engine</title>
<style type="text/css">

body{    font-size:20;    font-weight:bold; font-stretch:semi-expand; font-family:MSserif;    color:#0066CC;    
background-color:#EEEEE4;
align:center; background-color:white    }
h4{      background-color:#0066CC;     color:#FFFFFF;   font-family:verdana;  }
h3{  color:#0066CC;   }
th{  background-color:#6996ED;  color:#FFFFFF;  font-family:Arial;   }
a{text-decoration:none;}
</style>
</head>
<body>
<?php
if($submit)
{
    if(!
$keywords){
        
$errmsg="Sorry, Please fill in search field";
        
form($errmsg);
    }else{
           
//Start Timer
            
$start getmicrotime();

          
//PERFORM SEARCH OPERATION AND DISPLAY RESULT
        
}else {
            
//end Timer
            
$end getmicrotime();

            
//TOTAL TIME TAKEN TO DO SEARCH OPERATION
            
$time_taken=(float)($end-$start);
            
$time_taken=number_format($time_taken,2,'.','');

            echo 
"<p>Your Query Executed in $time_taken Seconds";

            
$errmsg="<p>No Search result found for '$keywords'";
            echo 
$errmsg;
            echo 
"<br /><a href=\"#\" onclick=\"history.back()\">Back</a>";
        }
//endof isset ref
    
}//end of if key word exists
} else {  //display the form
    
form($keyword);
//END OF FORM DISPLAY ?>
</body>
</html>
    <?php
function form($errmsg)
{  
?>
    <h4 align="center">Search Engine</h4>
        <b><?php echo $errmsg?></b>
        <form method=POST action=<?php echo $PHP_SELF ?>>
        Enter keywords to search on:
        <input type="text" name="keywords" maxlength="100" />
        <input type="submit" name="submit" value="Search" />
        </form>
        </body>
        </html>
        <?php
}


function 
getmicrotime()
{
    list(
$usec,$sec)=explode(" ",microtime());
    return ((float)
$usec+(float)$sec);
}
?>
Function getmicrotime() returns time in microseconds. This function is called during start and end of the search process.

Conclusion:

In this part 1, the search engine searches for the occurrence of words in the document. Part 2 is slightly modified such that when we upload the document, the number of occurrence of each word is stored in the link table. The search engine then ranks with the number of occurrence of each word in the document. For example, if the word ‘paging’ occurred 11 times, ‘programs’ occurred 21 times then the rank for the document is 11 + 12 = 23.

Source Code: