Re: Text analysis which expand the index with many words break subsequent analysis

Alexandre Rafalovitch Wed, 04 Mar 2015 12:13:03 -0800

Have you thought about using copyText with two different processing
pipelines? Then you could search both variants with different weights?


Regards,
   Alex.

----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 4 March 2015 at 14:18, fredericbaroz <fredericba...@gmail.com> wrote:
> Hello,
>
> My name is Frédéric Baroz. I work as a in-hospital physician in Intern
> Medicin in Switzerland (i speak french) and software engineer. I work in
> medical informatics and I m currently making some research about "semantic
> search" for in-hosp physician who are daily confronted with searching
> medical information.
>
> I am quite a newby in lucene/solr and I ve spend most of my time this last
> year, getting aquainted with this briliant technology. In the context of my
> work, I noticed that analysis, index-time or query-time, sometimes need to
> expand the text by injecting more or less processed tokens one after the
> other.
>
> One common scenario is to have the system "prefer" exact word match by
> injecting in the index a stemmed version along with the unmolested version
> of a token. Other tokenfilters have a similar behavior, like
> KeywordRepeatFilter which inject 2 version of each processed token, of which
> one is flagged in order to skip the stemming phase. A last example is
> AutoPhrasingTokenFilter, contribution from Lucidwork which offers a
> "workaround" for multi-term synonym matching (see
> http://lucidworks.com/blog/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/)
>
> One problem to this approach, as I understand it, is that filters that adopt
> this behavior, break analysis capabilities for subsequent filters. For
> example, if we use KeywordRepeatFilter and then AutoPhraseFilter, the latter
> will have no effect since it *never sees* the token series that it was
> waiting for, since one extra-word has been added after each word, because of
> KeywordRepeatFilter.
>
> In my opinion, tokens "to be injected" should be injected all at once, after
> the original token stream has been emitted, and not after each token seen by
> the filter. This would be in order not to break the ordered sequence of
> tokens, which in my opinion, carries some important information.
>
> So my question is: has anyone already adressed this problem and are there
> any workarounds that one might have thought of?
>
> and for the record, today, google is no friend to me ;)
>
> Thanks in advance for help,
>
> Frédéric Baroz
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Text-analysis-which-expand-the-index-with-many-words-break-subsequent-analysis-tp4191001.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Text analysis which expand the index with many words break subsequent analysis

Reply via email to