Re: Index & search questions; special cases

Walter Underwood Mon, 13 Nov 2006 10:19:36 -0800

On 11/12/06 8:52 PM, "Michael Imbeault" <[EMAIL PROTECTED]>
wrote:


> Sadly I can't rely on users smartness for this :) I have concerns that
> for stuff like Hepatitis A, it will match just about every document
> containing hepatitis and the very common 'a' word, anywhere in the
> document. I can't stopword single letters, cause then there would be no
> way to find documents about 'hepatitis c' and not about 'hepatitis b'
> for example. I will test my solution and report; if you have any other
> ideas, just tell me.

Nutch has phrase pre-filtering which helps with this. It indexes the
phrase fragments as separate terms and uses that set of matches to
filter the set of matching documents.

Another approach is to implement protected phrases, similar to the
protected words in stemming. These would be protected from stopword
processing.

A list of exception word and phrases is a pretty common trick in
other engines. Otherwise, you go nuts trying to get your analyzer
to handle ".NET" and "vitamin a". I know that AltaVista and Inktomi
did this.

wunder
-- 
Walter Underwood
Search Guru, Netflix

Re: Index & search questions; special cases

Reply via email to