Re: Multiplexing TokenFilter for multi-language?

Erick Erickson Tue, 09 Aug 2011 07:50:16 -0700

Frankly I don't know what gremlins lurk in this approach. You might hop
over to the dev list and ask the question there, the language gurus
will almost certainly weigh in. I'd ask the question without starting a JIRA
just to see what the response is...

Off hand, I can imagine that there would be some "interesting" effects, not
the least of which is instantiating a new one of these for every field would
be "interesting". Not to mention the problem you pointed out of removing the
language identifier from the stored data, or, more generally, how to
associate the language with the particular instance of a field in a particular
document. And how do you characterize languages that are "similar enough"
to be treated this way?

But I'm sure all those issues could be dealt with, but I wouldn't want to start
down that path before thinking through the implications for correctness. It's
not clear to me whether this approach would give good results. For instance,
how would you deal with synonyms in this case? Stopwords? And would
the relevance calculations be skewed? I'm not qualified to answer those
questions, but the folks on the dev list could at least frame the discussion.

Best
Erick

On Tue, Aug 9, 2011 at 10:16 AM, cnyee <yeec...@gmail.com> wrote:
> You are right - the stemmer was only instantiated twice. Not sure why it was
> instantiated twice. I tested with 10 and 50 records, maybe it was associated
> with the auto-commit cycle).
>
> What a bummer. Back to the drawing board again.
>
> Thanks for your input anyway. I was struggling with weird search behavior
> all day today. Now it all makes sense.
>
> I think a multiplexing stemmer would be a worthy extension for future
> version of SOLR.
>
> Best regards,
> Yee
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Multiplexing-TokenFilter-for-multi-language-tp3235341p3239103.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Multiplexing TokenFilter for multi-language?

Reply via email to