So you've got a dynamic site, filled with all sorts of user inputs, whether
it be a 'phorum', or like
my own site at knowpost.com. The site htdig.org will take care of indexing and searching your html pages, but if you are like me, you have very few html pages, and must of your
"content" resides in BLOBs in your database. You can't do anything useful using a like %searchword% query, it just isn't coming back relevant.
There has to be a better way, and indeed there is, with a few easy steps.
Here's how to slap one together:
Part one: BNR--Blob Noise Reduction
The first problem with your content is that it is filled with clunky
"noisewords," like "a,the,where,look"
Things that are there to help us humans to communicate, but really don't
have anything to do with relevance.
We gotta get rid of those. I've included a big list of noisewords
(noisewords.txt) for you to use, modify
or mutilate. Essentially, what we're trying to do here is get all those
noisewords out of your data, and build
a table with two columns, the word, and its indicator (the content
associated with it). We want something that
will eventually look like this: