Re: Multi-words synonyms matching

Lance Norskog Tue, 29 May 2012 14:57:50 -0700

I recently have had the same use case. I wound up doing this: in both
index and query time, the synonyms file is 'expand=false'. All
multi-word synonyms map to one single-word synonym (per group). This
way, only the main word is indexed or queried.


If the synonym file changes, you have to re-index the matching content.

On Tue, May 29, 2012 at 1:27 PM, elisabeth benoit
<elisaelisael...@gmail.com> wrote:
> Hello Bernd,
>
> Thanks a lot for your answer. I'll work on this.
>
> Best regards,
> Elisabeth
>
> 2012/5/29 Bernd Fehling <bernd.fehl...@uni-bielefeld.de>
>
>> Hello Elisabeth,
>>
>> my synonyms.txt is like your 2nd example:
>>
>> naturwald, φυσικό\ δάσος, естествена\ гора, prírodný\ les, naravni\ gozd,
>> foresta\ naturale, natuurbos, natural\ forest, bosque\ natural,
>> természetes\ erdő,
>> natūralus\ miškas, prirodna\ šuma, dabiskais\ mežs, floresta\ natural,
>> naturskov,
>> forêt\ naturelle, naturskog, přírodní\ les, luonnonmetsä, pădure\ naturală,
>> las\ naturalny, natürlicher\ wald
>>
>>
>> An example from my system with debugging turned on and searching for
>> "naturwald":
>>
>> <lst name="debug">
>>  <str name="rawquerystring">naturwald</str>
>>  <str name="querystring">naturwald</str>
>>  <str name="parsedquery">textth:naturwald textth:"φυσικό δάσος"
>> textth:"естествена гора"
>> textth:"prírodný les" textth:"naravni gozd" textth:"foresta naturale"
>> textth:natuurbos
>> textth:"natural forest" textth:"bosque natural" textth:"természetes erdő"
>> textth:"natūralus miškas" textth:"prirodna šuma" textth:"dabiskais mežs"
>> textth:"floresta natural" textth:naturskov textth:"forêt naturelle"
>> textth:naturskog
>> textth:"přírodní les" textth:luonnonmetsä textth:"pădure naturală"
>> textth:"las naturalny"
>> textth:"natürlicher wald"</str>
>> ...
>>
>> As you can see my search for "naturwald" extends to single and multiword
>> synonyms e.g. "forêt naturelle"
>>
>>
>> My SynonymFilterFactory has the following settings:
>>
>> org.apache.solr.analysis.SynonymFilterFactory
>> {tokenizerFactory=solr.KeywordTokenizerFactory,
>> synonyms=synonyms_eurovoc_desc_desc_ufall.txt, expand=true, format=solr,
>> ignoreCase=true,
>> luceneMatchVersion=LUCENE_36}
>>
>> But as I already mentioned, there is much more work to be done to get it
>> running than
>> just using SynonymFilterFactory.
>>
>> Regards
>> Bernd
>>
>>
>>
>> Am 23.05.2012 08:49, schrieb elisabeth benoit:
>> > Hello Bernd,
>> >
>> > Thanks for your advice.
>> >
>> > I have one question: how did you manage to map one word to a multiwords
>> > synonym???
>> >
>> > I've tried (in synonyms.txt)
>> >
>> > mairie, hotel de ville
>> >
>> > mairie, hotel\ de\ ville
>> >
>> > mairie => mairie, hotel de ville
>> >
>> > mairie => mairie, hotel\ de\ ville
>> >
>> > but nothing prevents mairie from matching with "hotel"...
>> >
>> > The only way I found is to use
>> > tokenizerFactory="solr.KeywordTokenizerFactory" in my synonyms
>> declaration
>> > in schema.xml, but then since "mairie" is not alone in my index field, it
>> > doesn't match.
>> >
>> >
>> > best regards,
>> > Elisabeth
>> >
>> >
>> >
>> >
>> > the only way I found, I schema.xml, is to use
>> >
>> >
>> >
>> > 2012/5/15 Bernd Fehling <bernd.fehl...@uni-bielefeld.de>
>> >
>> >> Without reading the whole thread let me say that you should not trust
>> >> the solr admin analysis. It takes the whole multiword search and runs
>> >> it all together at once through each analyzer step (factory).
>> >> But this is not how the real system works. First pitfall, the query
>> parser
>> >> is also splitting at white space (if not a phrase query). Due to this,
>> >> a multiword query is send chunk after chunk through the analyzer and,
>> >> second pitfall, each chunk runs through the whole analyzer by its own.
>> >>
>> >> So if you are dealing with multiword synonyms you have the following
>> >> problems. Either you turn your query into a phrase so that the whole
>> >> phrase is analyzed at once and therefore looked up as multiword synonym
>> >> but phrase queries are not analyzed !!! OR you send your query chunk
>> >> by chunk through the analyzer but then they are not multiwords anymore
>> >> and are not found in your synonyms.txt.
>> >>
>> >> From my experience I can say that it requires some deep work to get it
>> done
>> >> but it is possible. I have connected a thesaurus to solr which is doing
>> >> query time expansion (no need to reindex if the thesaurus changes).
>> >> The thesaurus holds synonyms and "used for terms" in 24 languages. So
>> >> it is also some kind of language translation. And naturally the
>> thesaurus
>> >> translates from single term to multi term synonyms and vice versa.
>> >>
>> >> Regards,
>> >> Bernd
>> >>
>> >>
>> >> Am 14.05.2012 13:54, schrieb elisabeth benoit:
>> >>> Just for the record, I'd like to conclude this thread
>> >>>
>> >>> First, you were right, there was no behaviour difference between fq
>> and q
>> >>> parameters.
>> >>>
>> >>> I realized that:
>> >>>
>> >>> 1) my synonym (hotel de ville) has a stopword in it (de) and since I
>> used
>> >>> tokenizerFactory="solr.KeywordTokenizerFactory" in my synonyms
>> >> declaration,
>> >>> there was no stopword removal in the indewed expression, so when
>> >> requesting
>> >>> "hotel de ville", after stopwords removal in query, Solr was comparing
>> >>> "hotel de ville"
>> >>> with "hotel ville"
>> >>>
>> >>> but my queries never even got to that point since
>> >>>
>> >>> 2) I made a mistake using "mairie" alone in the admin interface when
>> >>> testing my schema. The real field was something like "collectivités
>> >>> territoriales mairie",
>> >>> so the synonym "hotel de ville" was not even applied, because of the
>> >>> tokenizerFactory="solr.KeywordTokenizerFactory" in my synonym
>> definition
>> >>> not splitting field into words when parsing
>> >>>
>> >>> So my problem is not solved, and I'm considering solving it outside of
>> >> Solr
>> >>> scope, unless someone else has a clue
>> >>>
>> >>> Thanks again,
>> >>> Elisabeth
>> >>>
>> >>>
>> >>>
>> >>> 2012/4/25 Erick Erickson <erickerick...@gmail.com>
>> >>>
>> >>>> A little farther down the debug info output you'll find something
>> >>>> like this (I specified fq=name:features)
>> >>>>
>> >>>> <arr name="parsed_filter_queries">
>> >>>> <str>name:features</str>
>> >>>> </arr>
>> >>>>
>> >>>>
>> >>>> so it may well give you some clue. But unless I'm reading things
>> wrong,
>> >>>> your
>> >>>> q is going against a field that has much more information than the
>> >>>> CATEGORY_ANALYZED field, is it possible that the data from your
>> >>>> test cases simply isn't _in_ CATEGORY_ANALYZED?
>> >>>>
>> >>>> Best
>> >>>> Erick
>> >>>>
>> >>>> On Wed, Apr 25, 2012 at 9:39 AM, elisabeth benoit
>> >>>> <elisaelisael...@gmail.com> wrote:
>> >>>>> I'm not at the office until next Wednesday, and I don't have my Solr
>> >>>> under
>> >>>>> hand, but isn't debugQuery=on giving informations only about q
>> >> parameter
>> >>>>> matching and nothing about fq parameter? Or do you mean
>> >>>>> "parsed_filter_querie"s gives information about fq?
>> >>>>>
>> >>>>> CATEGORY_ANALYZED is being populated by a copyField instruction in
>> >>>>> schema.xml, and has the same field type as my catchall field, the
>> >> search
>> >>>>> field for my searchHandler (the one being used by q parameter).
>> >>>>>
>> >>>>> CATEGORY (a string) is copied in CATEGORY_ANALYZED (field type is
>> text)
>> >>>>>
>> >>>>> CATEGORY (a string) is copied in catchall field (field type is text),
>> >>>> and a
>> >>>>> lot of other fields are copied too in that catchall field.
>> >>>>>
>> >>>>> So as far as I can see, the same analysis should be done in both
>> cases,
>> >>>> but
>> >>>>> obviously I'm missing something, and the only thing I can think of
>> is a
>> >>>>> different behavior between q and fq parameter.
>> >>>>>
>> >>>>> I'll check that parsed_filter_querie first thing in the morning next
>> >>>>> Wednesday.
>> >>>>>
>> >>>>> Thanks a lot for your help.
>> >>>>>
>> >>>>> Elisabeth
>> >>>>>
>> >>>>>
>> >>>>> 2012/4/24 Erick Erickson <erickerick...@gmail.com>
>> >>>>>
>> >>>>>> Elisabeth:
>> >>>>>>
>> >>>>>> What shows up in the debug section of the response when you add
>> >>>>>> &debugQuery=on? There should be some bit of that section like:
>> >>>>>> "parsed_filter_queries"
>> >>>>>>
>> >>>>>> My other question is "are you absolutely sure that your
>> >>>>>> CATEGORY_ANALYZED field has the correct content?". How does it
>> >>>>>> get populated?
>> >>>>>>
>> >>>>>> Nothing jumps out at me here....
>> >>>>>>
>> >>>>>> Best
>> >>>>>> Erick
>> >>>>>>
>> >>>>>> On Tue, Apr 24, 2012 at 9:55 AM, elisabeth benoit
>> >>>>>> <elisaelisael...@gmail.com> wrote:
>> >>>>>>> yes, thanks, but this is NOT my question.
>> >>>>>>>
>> >>>>>>> I was wondering why I have multiple matches with q="hotel de ville"
>> >>>> and
>> >>>>>> no
>> >>>>>>> match with fq=CATEGORY_ANALYZED:"hotel de ville", since in both
>> case
>> >>>> I'm
>> >>>>>>> searching in the same solr fieldType.
>> >>>>>>>
>> >>>>>>> Why is q parameter behaving differently in that case? Why do the
>> >>>> quotes
>> >>>>>>> work in one case and not in the other?
>> >>>>>>>
>> >>>>>>> Does anyone know?
>> >>>>>>>
>> >>>>>>> Thanks,
>> >>>>>>> Elisabeth
>> >>>>>>>
>> >>>>>>> 2012/4/24 Jeevanandam <je...@myjeeva.com>
>> >>>>>>>
>> >>>>>>>>
>> >>>>>>>> usage of q and fq
>> >>>>>>>>
>> >>>>>>>> q => is typically the main query for the search request
>> >>>>>>>>
>> >>>>>>>> fq => is Filter Query; generally used to restrict the super set of
>> >>>>>>>> documents without influencing score (more info.
>> >>>>>>>> http://wiki.apache.org/solr/**CommonQueryParameters#q<
>> >>>>>> http://wiki.apache.org/solr/CommonQueryParameters#q>
>> >>>>>>>> )
>> >>>>>>>>
>> >>>>>>>> For example:
>> >>>>>>>> ------------
>> >>>>>>>> q="hotel de ville" ===> returns 100 documents
>> >>>>>>>>
>> >>>>>>>> q="hotel de ville"&fq=price:[100 To *]&fq=roomType:"King size Bed"
>> >>>> ===>
>> >>>>>>>> returns 40 documents from super set of 100 documents
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> hope this helps!
>> >>>>>>>>
>> >>>>>>>> - Jeevanandam
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On 24-04-2012 3:08 pm, elisabeth benoit wrote:
>> >>>>>>>>
>> >>>>>>>>> Hello,
>> >>>>>>>>>
>> >>>>>>>>> I'd like to resume this post.
>> >>>>>>>>>
>> >>>>>>>>> The only way I found to do not split synonyms in words in
>> >>>> synonyms.txt
>> >>>>>> it
>> >>>>>>>>> to use the line
>> >>>>>>>>>
>> >>>>>>>>>  <filter class="solr.**SynonymFilterFactory"
>> >> synonyms="synonyms.txt"
>> >>>>>>>>> ignoreCase="true" expand="true"
>> >>>>>>>>> tokenizerFactory="solr.**KeywordTokenizerFactory"/>
>> >>>>>>>>>
>> >>>>>>>>> in schema.xml
>> >>>>>>>>>
>> >>>>>>>>> where tokenizerFactory="solr.**KeywordTokenizerFactory"
>> >>>>>>>>>
>> >>>>>>>>> instructs SynonymFilterFactory not to break synonyms into words
>> on
>> >>>>>> white
>> >>>>>>>>> spaces when parsing synonyms file.
>> >>>>>>>>>
>> >>>>>>>>> So now it works fine, "mairie" is mapped into "hotel de ville"
>> and
>> >>>>>> when I
>> >>>>>>>>> send request q="hotel de ville" (quotes are mandatory to prevent
>> >>>>>> analyzer
>> >>>>>>>>> to split hotel de ville on white spaces), I get answers with word
>> >>>>>>>>> "mairie".
>> >>>>>>>>>
>> >>>>>>>>> But when I use fq parameter (fq=CATEGORY_ANALYZED:"hotel de
>> >>>> ville"), it
>> >>>>>>>>> doesn't work!!!
>> >>>>>>>>>
>> >>>>>>>>> CATEGORY_ANALYZED is same field type as default search field.
>> This
>> >>>>>> means
>> >>>>>>>>> that when I send q="hotel de ville" and
>> fq=CATEGORY_ANALYZED:"hotel
>> >>>> de
>> >>>>>>>>> ville", solr uses the same analyzer, the one with the line
>> >>>>>>>>>
>> >>>>>>>>> <filter class="solr.**SynonymFilterFactory"
>> synonyms="synonyms.txt"
>> >>>>>>>>> ignoreCase="true" expand="true"
>> >>>>>>>>> tokenizerFactory="solr.**KeywordTokenizerFactory"/>.
>> >>>>>>>>>
>> >>>>>>>>> Anyone as a clue what is different between q analysis behaviour
>> and
>> >>>> fq
>> >>>>>>>>> analysis behaviour?
>> >>>>>>>>>
>> >>>>>>>>> Thanks a lot
>> >>>>>>>>> Elisabeth
>> >>>>>>>>>
>> >>>>>>>>> 2012/4/12 elisabeth benoit <elisaelisael...@gmail.com>
>> >>>>>>>>>
>> >>>>>>>>>  oh, that's right.
>> >>>>>>>>>>
>> >>>>>>>>>> thanks a lot,
>> >>>>>>>>>> Elisabeth
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> 2012/4/11 Jeevanandam Madanagopal <je...@myjeeva.com>
>> >>>>>>>>>>
>> >>>>>>>>>>  Elisabeth -
>> >>>>>>>>>>>
>> >>>>>>>>>>> As you described, below mapping might suit for your need.
>> >>>>>>>>>>> mairie => hotel de ville, mairie
>> >>>>>>>>>>>
>> >>>>>>>>>>> mairie gets expanded to "hotel de ville" and "mairie" at index
>> >>>> time.
>> >>>>>>  So
>> >>>>>>>>>>> "mairie" and "hotel de ville" searchable on document.
>> >>>>>>>>>>>
>> >>>>>>>>>>> However, still white space tokenizer splits at query time will
>> be
>> >>>> a
>> >>>>>>>>>>> problem as described by Markus.
>> >>>>>>>>>>>
>> >>>>>>>>>>> --Jeevanandam
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Apr 11, 2012, at 12:30 PM, elisabeth benoit wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>>> <<Have you tried the "=>' mapping instead? Something
>> >>>>>>>>>>>> <<like
>> >>>>>>>>>>>> <<hotel de ville => mairie
>> >>>>>>>>>>>> <<might work for you.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Yes, thanks, I've tried it but from what I undestand it
>> doesn't
>> >>>>>> solve
>> >>>>>>>>>>> my
>> >>>>>>>>>>>> problem, since this means hotel de ville will be replace by
>> >>>> mairie
>> >>>>>> at
>> >>>>>>>>>>>> index time (I use synonyms only at index time). So when user
>> >>>> will
>> >>>>>> ask
>> >>>>>>>>>>>> "hôtel de ville", it won't match.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> In fact, at index time I have mairie in my data, but I want
>> user
>> >>>>>> to be
>> >>>>>>>>>>> able
>> >>>>>>>>>>>> to request "mairie" or "hôtel de ville" and have mairie as
>> >>>> answer,
>> >>>>>> and
>> >>>>>>>>>>> not
>> >>>>>>>>>>>> have mairie as an answer when requesting "hôtel".
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> <<To map `mairie` to `hotel de ville` as single token you must
>> >>>>>> escape
>> >>>>>>>>>>> your
>> >>>>>>>>>>>> white
>> >>>>>>>>>>>> <<space.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> <<mairie, hotel\ de\ ville
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> <<This results in  a problem if your tokenizer splits on white
>> >>>>>> space
>> >>>>>>>>>>> at
>> >>>>>>>>>>>> query
>> >>>>>>>>>>>> <<time.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Ok, I guess this means I have a problem. No simple solution
>> >>>> since
>> >>>>>> at
>> >>>>>>>>>>> query
>> >>>>>>>>>>>> time my tokenizer do split on white spaces.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> I guess my problem is more or less one of the problems
>> >>>> discussed in
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> http://lucene.472066.n3.**nabble.com/Multi-word-**
>> >>>>>>>>>>> synonyms-td3716292.html#**a3717215<
>> >>>>>>
>> >>>>
>> >>
>> http://lucene.472066.n3.nabble.com/Multi-word-synonyms-td3716292.html#a3717215
>> >>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Thanks a lot for your answers,
>> >>>>>>>>>>>> Elisabeth
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> 2012/4/10 Erick Erickson <erickerick...@gmail.com>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> Have you tried the "=>' mapping instead? Something
>> >>>>>>>>>>>>> like
>> >>>>>>>>>>>>> hotel de ville => mairie
>> >>>>>>>>>>>>> might work for you.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Best
>> >>>>>>>>>>>>> Erick
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> On Tue, Apr 10, 2012 at 1:41 AM, elisabeth benoit
>> >>>>>>>>>>>>> <elisaelisael...@gmail.com> wrote:
>> >>>>>>>>>>>>>> Hello,
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> I've read several post on this issue, but can't find a real
>> >>>>>> solution
>> >>>>>>>>>>> to
>> >>>>>>>>>>>>> my
>> >>>>>>>>>>>>>> multi-words synonyms matching problem.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> I have in my synonyms.txt an entry like
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> mairie, hotel de ville
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> and my index time analyzer is configured as followed for
>> >>>>>> synonyms.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> <filter class="solr.**SynonymFilterFactory"
>> >>>>>> synonyms="synonyms.txt"
>> >>>>>>>>>>>>>> ignoreCase="true" expand="true"/>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> The problem I have is that now "mairie" matches with "hotel"
>> >>>> and
>> >>>>>> I
>> >>>>>>>>>>> would
>> >>>>>>>>>>>>>> only want "mairie" to match with "hotel de ville" and
>> >>>> "mairie".
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> When I look into the analyzer, I see that "mairie" is mapped
>> >>>> into
>> >>>>>>>>>>>>> "hotel",
>> >>>>>>>>>>>>>> and words "de ville" are added in second and third position.
>> >>>> To
>> >>>>>>>>>>> change
>> >>>>>>>>>>>>>> that, I tried to do
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> <filter class="solr.**SynonymFilterFactory"
>> >>>>>> synonyms="synonyms.txt"
>> >>>>>>>>>>>>>> ignoreCase="true" expand="true"
>> >>>>>>>>>>>>>> tokenizerFactory="solr.**KeywordTokenizerFactory"/> (as I
>> >>>> read in
>> >>>>>>>>>>> one
>> >>>>>>>>>>> post)
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> and I can see now in the analyzer that "mairie" is mapped to
>> >>>>>> "hotel
>> >>>>>>>>>>> de
>> >>>>>>>>>>>>>> ville", but now when I have query "hotel de ville", it
>> doesn't
>> >>>>>> match
>> >>>>>>>>>>> at
>> >>>>>>>>>>>>> all
>> >>>>>>>>>>>>>> with "mairie".
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Anyone has a clue of what I'm doing wrong?
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> I'm using Solr 3.4.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>>> Elisabeth
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>
>> >>>>>>
>> >>>>
>> >>>
>> >>
>> >> --
>> >> *************************************************************
>> >> Bernd Fehling                Universitätsbibliothek Bielefeld
>> >> Dipl.-Inform. (FH)                        Universitätsstr. 25
>> >> Tel. +49 521 106-4060                   Fax. +49 521 106-4052
>> >> bernd.fehl...@uni-bielefeld.de                33615 Bielefeld
>> >>
>> >> BASE - Bielefeld Academic Search Engine - www.base-search.net
>> >> *************************************************************
>> >>
>> >
>>
>> --
>> *************************************************************
>> Bernd Fehling                Universitätsbibliothek Bielefeld
>> Dipl.-Inform. (FH)                        Universitätsstr. 25
>> Tel. +49 521 106-4060                   Fax. +49 521 106-4052
>> bernd.fehl...@uni-bielefeld.de                33615 Bielefeld
>>
>> BASE - Bielefeld Academic Search Engine - www.base-search.net
>> *************************************************************
>>



-- 
Lance Norskog
goks...@gmail.com

Re: Multi-words synonyms matching

Reply via email to