Introduction:
In part 1, the article discussed document based searches that display results based on the number of search
words found in each document. This article is an extension that ranks based on number of search words found
plus number of occurrences of each search word in the document.
To search for php tutorials and examples, the following table shows the title and occurrence of each search
word in the document. Common words like is, was, and etc. are removed from the search constraints by the program.
So in this example, we have three search words, php, tutorials and examples.
| No. | Article Number | php | Tutorial | Examples | Total Occurrence | Rank |
| 1. | Article #189 | 15 | 11 | 16 | 42 | 3 |
| 2. | Article #203 | 25 | 12 | 8 | 45 | 1 |
| 3. | Article #257 | 18 | 16 | 5 | 39 | 4 |
| 4. | Article #145 | 6 | 8 | 17 | 31 | 5 |
| 5. | Article #526 | 5 | 17 | 21 | 43 | 2 |
| 6. | Article #86 | 14 | 4 | 10 | 28 | 6 |
Article #203 has the highest occurrence and it is given rank 1. Similarly ranking is given for other results.
Building The Database:
The database consists of three tables. Document Table, Keyword Table and Link Table. Document Table holds
articles title, and abstract. Keyword Table holds keyword and the keyword field is indexed. Link Table holds keyword
id, content id, and occurrences.
The SQL Statement for creating these three tables are shown below.
Content Table:
CREATE TABLE content (
contid mediumint NOT NULL auto_increment,
title text NOT NULL,
abstract longtext NOT NULL,
PRIMARY KEY (contid)
) TYPE=MyISAM;
Keyword Table:
CREATE TABLE keytable (
keyid mediumint NOT NULL auto_increment,
keyword varchar(100) NOT NULL,
PRIMARY KEY (keyid),
KEY keyword (keyword)
) TYPE=MyISAM;
Link Table:
CREATE TABLE link (
keyid mediumint NOT NULL,
contid mediumint NOT NULL,
occurances mediumint NOT NULL
) TYPE=MyISAM;
Preparing Database:
The upload engine parses each word in the abstract and processes the whole text. It removes common
words like is, was, and, that
In Part 1, duplicate words are removed. Here every duplicate
word is counted as an occurrence. The $wordMap array is an associative array
that holds words and the number of occurrences.
Next, for every word in $wordMap array, the keyword table is searched. If a
match is found it stores the generated key id and occurrences content id in the link table or else the new
keyword is inserted in the keyword table. The link table is updated with occurrences,
content id and the newly generated key id.
FormWordList() Function:
This is the core part of the program. This function is called after the ExtractWords() function. This parses
filtered words and removes common words like a,is,was,and
. Other words are taken as valid words.
An associative array $wordMap which stores the word and the number of occurrences
in the document.
<?php
function FormWordList( $wordList ) {
global $COMMON_WORDS;
global $MAX_WORD_LENGTH;
$wordMap = array();
foreach ( $wordList as $word ) {
$len = strlen( $word );
if ( ($len > 1) && ($len < $MAX_WORD_LENGTH) ) {
if ( !$COMMON_WORDS[$word] ) {
if ( !$wordMap[$word] ) {
$wordMap[$word] = 1;
}else{
$wordMap[$word]++;
}
}
}
}
return $wordMap;
}
?>
Every word in $wordList is checked to see if it is a common word.
If TRUE the loop continues with the next word, or else it is checked for 'already exist' in the
$wordMap associative array. If FALSE, the word is added in
$wordMap with 'occurrence count 1'. Otherwise, the occurrence count is
incremented by 1.
ProcessForm Function():
The code is similar to Part 1 coding, only here the occurrence count is added in link table along
with key id an content id. Here is the code.
<?php
while(list($word,$occurances)=each($wordList)){
$keyId = "";
if ( !$allWords[$word] ) {
mysql_query( sprintf( "INSERT INTO keytable ( keyword ) VALUES ( '%s' )",
mysql_escape_string($word) ) );
$keyId = mysql_insert_id();
$allWords[$word] = $keyId;
}
else {
$keyId = $allWords[$word];
}
// insert the link
mysql_query( sprintf( "INSERT INTO link (keyid, contid, occurrences)
VALUES ( %d, %d, %d)",
$keyId, $contentId,$occurances ) );
}
?>
Search Engine:
As discussed in the Introduction part, here the search is performed with number of occurrences
in each document. Here is the code.
<?php
while($lRow=mysql_fetch_array($lResult)){
$thisContentId=$lRow["contid"];
if(!$contArray[$thisContentId]){
$contArray[$thisContentId]["oc"]=$lRow["occurances"];
$contArray[$thisContentId]["id"]=$lRow["contid"];
$contArray[$thisContentId]["wrank"]=1;
}else{
$contArray[$thisContentId]["oc"]+=$lRow["occurances"];
$contArray[$thisContentId]["wrank"]++;
}
}
?>
For every record in the results of the link table, the content id and number of occurrences is stored in an
associative array $contArray. During 'while loop operation', if the content
id already exists in $contArray, the occurrence is incremented with this new
occurrence value.
Now $contArray is set and it shows that some results are found in the database table.
Otherwise, the program skips to the next part that displays the result NO RESULTS FOUND
<?php
if(isset($contArray)){
//declare an array to store the results
$FoundRef=array();
//Sort array in desending order of the key value
arsort($contArray,SORT_DESC);
//Store the results in the $FoundRef Array
//code for this is given in the next line.
}
?>
In the next step we have to fetch title, the first 200 words in content table, into an array
$FoundRef.
<?php
foreach($contArray as $cont){
$rank=$cont["wrank"];
if ($rank == $noofSearchWords ) {
$contentId = $cont["id"];
$occurances = $cont["oc"];
$aQuery = "select contid,title,left(abstract,200) as summary from content where contid = " . $contentId;
$aResult = mysql_query($aQuery);
if(mysql_num_rows($aResult) > 0){
$aRow = mysql_fetch_array($aResult);
$FoundRef[] = array (
"contid" => $aRow["contid"],
"title" => $aRow["title"],
"summary" => $aRow["summary"],
"occurance"=>$occurances );
}//end of if
} //end of for each
?>
Finally we have to display the results in the browser. Here is the code.
<?php
if(isset($FoundRef))
{
echo "<table width=\"100%\"><tr><th class=\"title\">Search Result</td></tr></table>";
echo "<br />";
echo sizeof($FoundRef);
echo (sizeof($FoundRef) == 1 ? " reference" : " references");
echo " found";
if($junkWords)
{
echo "Common words like";
foreach($junkWords as $jWords)
{
echo " "."'".$jWords."'";
}
echo "are removed from the search string";
}
echo "</h5>";
foreach($FoundRef as $a => $value)
{
echo "<table>";
echo "<tr><td valign=\"top\">";
// echo $FoundRef[$a]["contid"];
<a href=showref.php?refid=<?php echo $FoundRef[$a]["contid"]?>><emp><b>
<?php echo $FoundRef[$a]["title"]?></b></emp></a><div align="right">
Occurance(s):
<?php echo $FoundRef[$a]["occurance"] ?></div>
<br /><small>
<?php echo $FoundRef[$a]["summary"] ?>...</small>
<br /><br />
<?php echo "</td></tr>";
}?>
<?php echo "</table>";
}//end of isset FoundRef
?>
Timer to Calculate the Time Taken to Search the Documents:
You can include a timer that calculates the time period to do the search operation. Here is the code.
The following function calculates the time in microseconds.
Thus we come to an end of Document Based Search that displays results based number of search words
found plus the number of occurrence of each search word in each document.
I implemented this technique after several optimizations to reduce the search time. I also tested
this technique over 60000 distinct documents. Initially the search time was around 23.35 seconds
and on consequent optimizations the search time was reduced to 10.89 seconds, 3.56 seconds and
finally to 0.71 seconds. Also note that the search time varies with the hardware setup. I welcome
comments on this article to optimize the performance further.