Re: Index & search questions; special cases

2006-11-19 Thread Chris Hostetter
: Chris, thanks for the tips (or should I say, detailed explanation!). I : actually got it working! It was a pain at first (never did any java, and good to know .. glad it worked out for you. : If Solr is interested in the filter, just tell me (and how should I do : to contribute it). The full

Re: Index & search questions; special cases

2006-11-18 Thread Michael Imbeault
CommonGrams itself seems to have some other dependencies on nutch because of other utilities in the same class, but based on a quick skim, what you really want is the nested "private static class Filter extends TokenFilter" which doesn't really have any external dependencies. If you extract that

Re: Index & search questions; special cases

2006-11-15 Thread Chris Hostetter
: > Yeah, the Nutch code is highly intertwined with its unique configuration : > infrastructure and makes it hard to pull pieces of it out like this. that CacheGrams inner Filter classe seemed like it could be extracted easily enough. : This is a critique that has been heard a lot (mainly becaus

Re: Index & search questions; special cases

2006-11-15 Thread Sami Siren
Erik Hatcher wrote: Yeah, the Nutch code is highly intertwined with its unique configuration infrastructure and makes it hard to pull pieces of it out like this. This is a critique that has been heard a lot (mainly because its true :) It would be really cool if different camps of lucene could

Re: Index & search questions; special cases

2006-11-14 Thread Erik Hatcher
On Nov 14, 2006, at 2:00 PM, Chris Hostetter wrote: CommonGrams itself seems to have some other dependencies on nutch because of other utilities in the same class, but based on a quick skim, what you really want is the nested "private static class Filter extends TokenFilter" which doesn't re

Re: Index & search questions; special cases

2006-11-14 Thread Chris Hostetter
: > : Nutch has phrase pre-filtering which helps with this. It indexes the : > : phrase fragments as separate terms and uses that set of matches to : > : filter the set of matching documents. : > That reminds me ... i seem to remember someone saying once that Nutch lso : > builds word based n-gra

Re: Re: Index & search questions; special cases

2006-11-13 Thread Mike Klaas
On 11/13/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: Hello everyone, Thanks for all your answers; synonyms based approaches won't work because the medical / research field is evolving way too fast; it would Another approach is to extract the term explicitly. An easy-to-implement approach

Re: Index & search questions; special cases

2006-11-13 Thread Michael Imbeault
Hello everyone, Thanks for all your answers; synonyms based approaches won't work because the medical / research field is evolving way too fast; it would become unmaintainable very quickly, and the list would be huge. Anyway, I can't rely on score because I'm sorting by date, so I need to eli

Re: Index & search questions; special cases

2006-11-13 Thread Otis Gospodnetic
Indeed. CommonGrams.java in Nutch is the place to look. Otis - Original Message From: Erik Hatcher <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Monday, November 13, 2006 2:08:51 PM Subject: Re: Index & search questions; special cases On Nov 13, 2006, at 1:51

Re: Index & search questions; special cases

2006-11-13 Thread Erik Hatcher
On Nov 13, 2006, at 1:51 PM, Chris Hostetter wrote: That reminds me ... i seem to remember someone saying once that Nutch lso builds word based n-grams out of it's stop words, so searches on "the" or "on" won't match anything because those words are never indexed as a single tokens, but if a

Re: Index & search questions; special cases

2006-11-13 Thread Yonik Seeley
On 11/13/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: The SynonymFilter could have the following config: hepatitis a, hepatitis_a Oops, the synonyms should be reversed like so: hepatitis_a, hepatitis a so that when expand="false" for querying, hepatitis a is mapped to hepatitis_a -Yonik

Re: Index & search questions; special cases

2006-11-13 Thread Yonik Seeley
On 11/12/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: - Somewhat related : Let's say I index "Polymyxin B". If I stopword single letters, would a phrase search ("Polymyxin B") still find the right documents (I don't think so, but still)? If not, I'll have to index single letters; how do I prev

Re: Index & search questions; special cases

2006-11-13 Thread Chris Hostetter
: > Sadly I can't rely on users smartness for this :) I have concerns that : > for stuff like Hepatitis A, it will match just about every document : > containing hepatitis and the very common 'a' word, anywhere in the : > document. I can't stopword single letters, cause then there would be no : >

Re: Index & search questions; special cases

2006-11-13 Thread Yonik Seeley
On 11/13/06, Walter Underwood <[EMAIL PROTECTED]> wrote: Another approach is to implement protected phrases, similar to the protected words in stemming. These would be protected from stopword processing. One could use the synonym filter (which can handle multi-token synonyms) to get this effect

Re: Index & search questions; special cases

2006-11-13 Thread Walter Underwood
On 11/12/06 8:52 PM, "Michael Imbeault" <[EMAIL PROTECTED]> wrote: > Sadly I can't rely on users smartness for this :) I have concerns that > for stuff like Hepatitis A, it will match just about every document > containing hepatitis and the very common 'a' word, anywhere in the > document. I can't

Re: Index & search questions; special cases

2006-11-12 Thread Michael Imbeault
Chris Hostetter wrote: A couple of things make your question really hard to answer ... first off, you can specify differnet analyser chains for index time and query time -- shen dealing with the WordDelim filter (or the synonym fitler) this is frequently neccessary -- so the ansers to your questi

Re: Index & search questions; special cases

2006-11-12 Thread Chris Hostetter
: - Let's say I index "HIV-1" with . Would a search on HIV AND 1 (or even HIV-1, which : after parsing by the above filter would yield HIV1 or HIV 1) also find : documents which have HIV and the number "1" somewhere in the document, : but not directly after HIV? If so, how should I fix this? I cou

Index & search questions; special cases

2006-11-12 Thread Michael Imbeault
Hello again, - Let's say I index "HIV-1" with class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1"/>. Would a search on HIV AND 1 (or even HIV-1, which after parsing by the above filter would yield HIV1 or H