: Chris, thanks for the tips (or should I say, detailed explanation!). I
: actually got it working! It was a pain at first (never did any java, and
good to know .. glad it worked out for you.
: If Solr is interested in the filter, just tell me (and how should I do
: to contribute it).
The full
CommonGrams itself seems to have some other dependencies on nutch because
of other utilities in the same class, but based on a quick skim, what you
really want is the nested "private static class Filter extends
TokenFilter" which doesn't really have any external dependencies. If you
extract that
: > Yeah, the Nutch code is highly intertwined with its unique configuration
: > infrastructure and makes it hard to pull pieces of it out like this.
that CacheGrams inner Filter classe seemed like it could be extracted
easily enough.
: This is a critique that has been heard a lot (mainly becaus
Erik Hatcher wrote:
Yeah, the Nutch code is highly intertwined with its unique configuration
infrastructure and makes it hard to pull pieces of it out like this.
This is a critique that has been heard a lot (mainly because its true :)
It would be really cool if different camps of lucene could
On Nov 14, 2006, at 2:00 PM, Chris Hostetter wrote:
CommonGrams itself seems to have some other dependencies on nutch
because
of other utilities in the same class, but based on a quick skim,
what you
really want is the nested "private static class Filter extends
TokenFilter" which doesn't re
: > : Nutch has phrase pre-filtering which helps with this. It indexes the
: > : phrase fragments as separate terms and uses that set of matches to
: > : filter the set of matching documents.
: > That reminds me ... i seem to remember someone saying once that Nutch lso
: > builds word based n-gra
On 11/13/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:
Hello everyone,
Thanks for all your answers; synonyms based approaches won't work
because the medical / research field is evolving way too fast; it would
Another approach is to extract the term explicitly. An
easy-to-implement approach
Hello everyone,
Thanks for all your answers; synonyms based approaches won't work
because the medical / research field is evolving way too fast; it would
become unmaintainable very quickly, and the list would be huge. Anyway,
I can't rely on score because I'm sorting by date, so I need to
eli
Indeed. CommonGrams.java in Nutch is the place to look.
Otis
- Original Message
From: Erik Hatcher <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Monday, November 13, 2006 2:08:51 PM
Subject: Re: Index & search questions; special cases
On Nov 13, 2006, at 1:51
On Nov 13, 2006, at 1:51 PM, Chris Hostetter wrote:
That reminds me ... i seem to remember someone saying once that
Nutch lso
builds word based n-grams out of it's stop words, so searches on "the"
or "on" won't match anything because those words are never indexed
as a
single tokens, but if a
On 11/13/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
The SynonymFilter could have the following config:
hepatitis a, hepatitis_a
Oops, the synonyms should be reversed like so:
hepatitis_a, hepatitis a
so that when expand="false" for querying, hepatitis a is mapped to hepatitis_a
-Yonik
On 11/12/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:
- Somewhat related : Let's say I index "Polymyxin B". If I stopword
single letters, would a phrase search ("Polymyxin B") still find the
right documents (I don't think so, but still)? If not, I'll have to
index single letters; how do I prev
: > Sadly I can't rely on users smartness for this :) I have concerns that
: > for stuff like Hepatitis A, it will match just about every document
: > containing hepatitis and the very common 'a' word, anywhere in the
: > document. I can't stopword single letters, cause then there would be no
: >
On 11/13/06, Walter Underwood <[EMAIL PROTECTED]> wrote:
Another approach is to implement protected phrases, similar to the
protected words in stemming. These would be protected from stopword
processing.
One could use the synonym filter (which can handle multi-token
synonyms) to get this effect
On 11/12/06 8:52 PM, "Michael Imbeault" <[EMAIL PROTECTED]>
wrote:
> Sadly I can't rely on users smartness for this :) I have concerns that
> for stuff like Hepatitis A, it will match just about every document
> containing hepatitis and the very common 'a' word, anywhere in the
> document. I can't
Chris Hostetter wrote:
A couple of things make your question really hard to answer ... first off,
you can specify differnet analyser chains for index time and query time --
shen dealing with the WordDelim filter (or the synonym fitler) this is
frequently neccessary -- so the ansers to your questi
: - Let's say I index "HIV-1" with . Would a search on HIV AND 1 (or even HIV-1, which
: after parsing by the above filter would yield HIV1 or HIV 1) also find
: documents which have HIV and the number "1" somewhere in the document,
: but not directly after HIV? If so, how should I fix this? I cou
Hello again,
- Let's say I index "HIV-1" with class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="1"/>. Would a search on HIV AND 1 (or even HIV-1, which
after parsing by the above filter would yield HIV1 or H
18 matches
Mail list logo