Hello, My name is Frédéric Baroz. I work as a in-hospital physician in Intern Medicin in Switzerland (i speak french) and software engineer. I work in medical informatics and I m currently making some research about "semantic search" for in-hosp physician who are daily confronted with searching medical information.
I am quite a newby in lucene/solr and I ve spend most of my time this last year, getting aquainted with this briliant technology. In the context of my work, I noticed that analysis, index-time or query-time, sometimes need to expand the text by injecting more or less processed tokens one after the other. One common scenario is to have the system "prefer" exact word match by injecting in the index a stemmed version along with the unmolested version of a token. Other tokenfilters have a similar behavior, like KeywordRepeatFilter which inject 2 version of each processed token, of which one is flagged in order to skip the stemming phase. A last example is AutoPhrasingTokenFilter, contribution from Lucidwork which offers a "workaround" for multi-term synonym matching (see http://lucidworks.com/blog/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/) One problem to this approach, as I understand it, is that filters that adopt this behavior, break analysis capabilities for subsequent filters. For example, if we use KeywordRepeatFilter and then AutoPhraseFilter, the latter will have no effect since it *never sees* the token series that it was waiting for, since one extra-word has been added after each word, because of KeywordRepeatFilter. In my opinion, tokens "to be injected" should be injected all at once, after the original token stream has been emitted, and not after each token seen by the filter. This would be in order not to break the ordered sequence of tokens, which in my opinion, carries some important information. So my question is: has anyone already adressed this problem and are there any workarounds that one might have thought of? and for the record, today, google is no friend to me ;) Thanks in advance for help, Frédéric Baroz -- View this message in context: http://lucene.472066.n3.nabble.com/Text-analysis-which-expand-the-index-with-many-words-break-subsequent-analysis-tp4191001.html Sent from the Solr - User mailing list archive at Nabble.com.