Here's the best thread I've found so far about multi-word matching and synonyms: http://www.nabble.com/solr-synonyms-behaviour-ts15051211.html#a18476205
And an interesting workaround: http://www.nabble.com/solr-synonyms-behaviour-ts15051211.html#a18693735 Earlier on the thread repeats the claim that, if you use index side expansion, you won't have a problem. But it doesn't explain how/why that fixes it, given that the Lucene parser still breaks on white space. Later there's a clue, it seems that even single words of a multi-word thesaurus entry are matched - so I guess Lucene doesn't need to see both words in a multi-word query, it just picks up either word, so it works around the multi-word parsing problem, but adds the undesireable side effect of false positive matches? So the repeated claim that index side expansion fixes multi-word matching should always carry the caveat "... and it can cause false positive matches when only one of the words is present?" Am I understanding this correctly? If true, it's to be acceptable in many applications, it's just a question understanding the trade offs. Mark -- Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513 On Mon, Aug 24, 2009 at 10:47 AM, Mark Bennett <mbenn...@ideaeng.com> wrote: > There are a couple of things about the Solr Thesaurus doc that I'd like to > confirm / understand. > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#SynonymFilter > > There's a section about multi word matching, using seabiscit as an > example. I've also seen references to this discussion in posts talking > about dismax and the synonym filter. (quoted below). Where I think it > could use some additional clarification is in this sentence: > "The recommended approach ... is to expand the synonym when indexing." > > The section below describes why not doing it this way won't work, but it > doesn't explain how using index-time expansion fixes it. In particular, > even if I do index time expansion, isn't a multi word input synonym still > doing to be messed with by the Lucene parser. From the Wiki "The Lucene > QueryParser tokenizes on white space before giving any text to the > Analyzer... ". Understood, but how does index time expansion address that, > either directly or indirectly? > > > Keep in mind that while the SynonymFilter will happily work with synonyms > containing multiple words (ie: "sea biscuit, sea biscit, seabiscuit") > > The recommended approach for dealing with synonyms like this, is to > expand the synonym when indexing. This is because there are two > > potential issues that can arrise at query time: > > > > 1: The Lucene QueryParser tokenizes on white space before giving any text > to the Analyzer, so if a person searches for the words > > sea biscit the analyzer will be given the words "sea" and "biscit" > seperately, and will not know that they match a synonym. > > > > 2: Phrase searching (ie: "sea biscit") will cause the QueryParser to pass > the entire string to the analyzer, but if the SynonymFilter > > is configured to expand the synonyms, then when the QueryParser gets the > resulting list of tokens back from the Analyzer, it will > > construct a MultiPhraseQuery that will not have the desired effect. This > is because of the limited mechanism available for the > > Analyzer to indicate that two terms occupy the same position: there is no > way to indicate that a "phrase" occupies the same position > > as a term. For our example the resulting MultiPhraseQuery would be "(sea > | sea | seabiscuit) (biscuit | biscit)" which would not match > > the simple case of "seabisuit" occuring in a document > > -- > Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com > Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513 >