On 11/12/06 8:52 PM, "Michael Imbeault" <[EMAIL PROTECTED]> wrote:
> Sadly I can't rely on users smartness for this :) I have concerns that > for stuff like Hepatitis A, it will match just about every document > containing hepatitis and the very common 'a' word, anywhere in the > document. I can't stopword single letters, cause then there would be no > way to find documents about 'hepatitis c' and not about 'hepatitis b' > for example. I will test my solution and report; if you have any other > ideas, just tell me. Nutch has phrase pre-filtering which helps with this. It indexes the phrase fragments as separate terms and uses that set of matches to filter the set of matching documents. Another approach is to implement protected phrases, similar to the protected words in stemming. These would be protected from stopword processing. A list of exception word and phrases is a pretty common trick in other engines. Otherwise, you go nuts trying to get your analyzer to handle ".NET" and "vitamin a". I know that AltaVista and Inktomi did this. wunder -- Walter Underwood Search Guru, Netflix