Thanks for the advice Grant, I've tried putting '_' into synonyms, but step by step I've realised that it what always more intrusive into Solr source code... But I've found another solution, that I want to expose here in order to have external advice and perhaps pointing out some bugs or side effect I've not seen. I do not touch the source code but I only change my synonym.txt and the way I manage indexes on schema.xml.
Giving a synonyms list like : capital punishement, death sentence, death penalty 10, dix, X 17, Dix sept, XVII 18, dix huit, XVIII Rock, jazz, modern music => modern music Coluche, colucci => colucci Coluche, coluci => coluci Coluche, colucchi => colucchi coluche, michel colucci => michel colucci I was faced with two major problems with index time synonym expansion (@ expand=true: - Possibility of synonyms mix ("10, dix, X" with "17, Dix sept, XVII" or "18, dix huit, XVIII") - Possibility of query that could match some unexpected result due to language ambiguity, and in a more generic way, due to the fact that expansion put new token in document that will be matched at wuery time (ex: query "capitale" will match a document with " death sentence "..) So here what I've done: A single line in synonym file could by seen as a family of synonyms, or switcheable term and expressions. So instead of injecting (into document at index time) for a single match, all the possibilities founded in the synonyms list, I've changed the list in order to give an ID for each synonyms families and the index time synonyms filter is no more configured with expand=true but with expand=false in order to replace a matched term with the ID of his family. Then at query time, I reintroduced the synonyms filter with expand=false in order to replace in the query the matched synonyms with their corresponding ID Her my synonyms list used with expand=false SynFamily1, capital punishement, death sentence, death penalty SynFamily2, 10, dix, x SynFamily89, 17, xvii, dix sept SynFamily112, 18, xviii, dix huit rock, modern music => HierFamily2017 jazz, modern music => HierFamily2014 coluche, collucci => HierFamily1537 coluche, colluche => HierFamily1538 coluche, colucchi => HierFamily1541 coluche, colucci => HierFamily1542 coluche, coluchi => HierFamily1543 coluche, coluci => HierFamily1544 It seems to work fine since now a query "capital" will not match a document that originally contains "death sentence" since the synonyms expansion is limited to the one-token ID "SynFamily1", and in order to match such a document, a query like "capital punishement" must been made. The synonyms mixing also seems to have disappeared (document containing "dix huit" will not match for a query "10") My question is, do I've missed something ? The solution seems to much simple and since I'm working on fulltext search engine I've always faced side effects problems after logic modification, so I'm a little sceptic... :) Voila ! Thanks for your time Laurent -----Message d'origine----- De : Grant Ingersoll [mailto:[EMAIL PROTECTED] Envoyé : mardi 11 septembre 2007 14:53 À : solr-user@lucene.apache.org Objet : Re: Synonyms expressions sens Inline... On Sep 11, 2007, at 7:27 AM, Laurent Gilles wrote: > Hi, > > > > I'm actually facing a relevancy issue with multiword synonyms. > > > > Let's expose it by a test case: > > > > Giving the following synonyms definitions: > > -------------------------------------------------------------------- > > capital punishement, death sentence, death penalty > > -------------------------------------------------------------------- > > > > And a [EMAIL PROTECTED] defined at index time, so the > document: > > -------------------------------------------------------------------- > > The prisoner escaped just before the death sentence had been set. > > -------------------------------------------------------------------- > > > > Will be indexed like > > -------------------------------------------------------------------- > > The prisoner escaped just before the (death sentence | death penalty | > capital punishment) had been set. > > -------------------------------------------------------------------- > > > > Now, if a user asks for "capital", the system will match > "capital" (that > could mean 'Paris, capital of France') into the index time synonyms > expanded > document, which doesn't have sense. > > I was expecting that in order to match, I'll have to give the entire > expression "capital punishment" to match a document that contains " > death > sentence" and not only a part of the expression. > > > > It seems to be the normal Solr behaviour, but what I'm actually > facing is a > relevance problem with the given results, since a given word > contained in an > expression could have a completely different meaning compared with > the same > isolated word. > > > > Is their a trick or a way to match synonym complete expression and > not the > words the expands have added into documents ? > Ah, the ambiguity of language :-) I can think of a couple of different suggestions to try: 1. Index your phrase synonyms as a single token, such as capital_punishment, death_penalty, etc. This requires that you be able to recognize phrases during indexing and querying, since you will want to transform capital punishment in your documents to capital_punishment. Alternatively, you could create a query like ("capital punishment" OR capital_punishment) 2. On the query side, you could produce queries like: capital AND -"capital punishment" I don't know your system, but I suppose there is always the chance that a user searching for capital really does want all occurrences of capital (assuming no other context) which may cause problems HTH, Grant