Re: Multi-words synonyms matching

Bernd Fehling Tue, 29 May 2012 05:57:13 -0700

Hello Elisabeth,

my synonyms.txt is like your 2nd example:


naturwald, φυσικό\ δάσος, естествена\ гора, prírodný\ les, naravni\ gozd,
foresta\ naturale, natuurbos, natural\ forest, bosque\ natural, természetes\ 
erdő,
natūralus\ miškas, prirodna\ šuma, dabiskais\ mežs, floresta\ natural, 
naturskov,
forêt\ naturelle, naturskog, přírodní\ les, luonnonmetsä, pădure\ naturală,
las\ naturalny, natürlicher\ wald


An example from my system with debugging turned on and searching for 
"naturwald":

<lst name="debug">
  <str name="rawquerystring">naturwald</str>
  <str name="querystring">naturwald</str>
  <str name="parsedquery">textth:naturwald textth:"φυσικό δάσος" 
textth:"естествена гора"
textth:"prírodný les" textth:"naravni gozd" textth:"foresta naturale" 
textth:natuurbos
textth:"natural forest" textth:"bosque natural" textth:"természetes erdő"
textth:"natūralus miškas" textth:"prirodna šuma" textth:"dabiskais mežs"
textth:"floresta natural" textth:naturskov textth:"forêt naturelle" 
textth:naturskog
textth:"přírodní les" textth:luonnonmetsä textth:"pădure naturală" textth:"las 
naturalny"
textth:"natürlicher wald"</str>
...

As you can see my search for "naturwald" extends to single and multiword 
synonyms e.g. "forêt naturelle"


My SynonymFilterFactory has the following settings:

org.apache.solr.analysis.SynonymFilterFactory
{tokenizerFactory=solr.KeywordTokenizerFactory, 
synonyms=synonyms_eurovoc_desc_desc_ufall.txt, expand=true, format=solr, 
ignoreCase=true,
luceneMatchVersion=LUCENE_36}

But as I already mentioned, there is much more work to be done to get it 
running than
just using SynonymFilterFactory.

Regards
Bernd



Am 23.05.2012 08:49, schrieb elisabeth benoit:
> Hello Bernd,
> 
> Thanks for your advice.
> 
> I have one question: how did you manage to map one word to a multiwords
> synonym???
> 
> I've tried (in synonyms.txt)
> 
> mairie, hotel de ville
> 
> mairie, hotel\ de\ ville
> 
> mairie => mairie, hotel de ville
> 
> mairie => mairie, hotel\ de\ ville
> 
> but nothing prevents mairie from matching with "hotel"...
> 
> The only way I found is to use
> tokenizerFactory="solr.KeywordTokenizerFactory" in my synonyms declaration
> in schema.xml, but then since "mairie" is not alone in my index field, it
> doesn't match.
> 
> 
> best regards,
> Elisabeth
> 
> 
> 
> 
> the only way I found, I schema.xml, is to use
> 
> 
> 
> 2012/5/15 Bernd Fehling <bernd.fehl...@uni-bielefeld.de>
> 
>> Without reading the whole thread let me say that you should not trust
>> the solr admin analysis. It takes the whole multiword search and runs
>> it all together at once through each analyzer step (factory).
>> But this is not how the real system works. First pitfall, the query parser
>> is also splitting at white space (if not a phrase query). Due to this,
>> a multiword query is send chunk after chunk through the analyzer and,
>> second pitfall, each chunk runs through the whole analyzer by its own.
>>
>> So if you are dealing with multiword synonyms you have the following
>> problems. Either you turn your query into a phrase so that the whole
>> phrase is analyzed at once and therefore looked up as multiword synonym
>> but phrase queries are not analyzed !!! OR you send your query chunk
>> by chunk through the analyzer but then they are not multiwords anymore
>> and are not found in your synonyms.txt.
>>
>> From my experience I can say that it requires some deep work to get it done
>> but it is possible. I have connected a thesaurus to solr which is doing
>> query time expansion (no need to reindex if the thesaurus changes).
>> The thesaurus holds synonyms and "used for terms" in 24 languages. So
>> it is also some kind of language translation. And naturally the thesaurus
>> translates from single term to multi term synonyms and vice versa.
>>
>> Regards,
>> Bernd
>>
>>
>> Am 14.05.2012 13:54, schrieb elisabeth benoit:
>>> Just for the record, I'd like to conclude this thread
>>>
>>> First, you were right, there was no behaviour difference between fq and q
>>> parameters.
>>>
>>> I realized that:
>>>
>>> 1) my synonym (hotel de ville) has a stopword in it (de) and since I used
>>> tokenizerFactory="solr.KeywordTokenizerFactory" in my synonyms
>> declaration,
>>> there was no stopword removal in the indewed expression, so when
>> requesting
>>> "hotel de ville", after stopwords removal in query, Solr was comparing
>>> "hotel de ville"
>>> with "hotel ville"
>>>
>>> but my queries never even got to that point since
>>>
>>> 2) I made a mistake using "mairie" alone in the admin interface when
>>> testing my schema. The real field was something like "collectivités
>>> territoriales mairie",
>>> so the synonym "hotel de ville" was not even applied, because of the
>>> tokenizerFactory="solr.KeywordTokenizerFactory" in my synonym definition
>>> not splitting field into words when parsing
>>>
>>> So my problem is not solved, and I'm considering solving it outside of
>> Solr
>>> scope, unless someone else has a clue
>>>
>>> Thanks again,
>>> Elisabeth
>>>
>>>
>>>
>>> 2012/4/25 Erick Erickson <erickerick...@gmail.com>
>>>
>>>> A little farther down the debug info output you'll find something
>>>> like this (I specified fq=name:features)
>>>>
>>>> <arr name="parsed_filter_queries">
>>>> <str>name:features</str>
>>>> </arr>
>>>>
>>>>
>>>> so it may well give you some clue. But unless I'm reading things wrong,
>>>> your
>>>> q is going against a field that has much more information than the
>>>> CATEGORY_ANALYZED field, is it possible that the data from your
>>>> test cases simply isn't _in_ CATEGORY_ANALYZED?
>>>>
>>>> Best
>>>> Erick
>>>>
>>>> On Wed, Apr 25, 2012 at 9:39 AM, elisabeth benoit
>>>> <elisaelisael...@gmail.com> wrote:
>>>>> I'm not at the office until next Wednesday, and I don't have my Solr
>>>> under
>>>>> hand, but isn't debugQuery=on giving informations only about q
>> parameter
>>>>> matching and nothing about fq parameter? Or do you mean
>>>>> "parsed_filter_querie"s gives information about fq?
>>>>>
>>>>> CATEGORY_ANALYZED is being populated by a copyField instruction in
>>>>> schema.xml, and has the same field type as my catchall field, the
>> search
>>>>> field for my searchHandler (the one being used by q parameter).
>>>>>
>>>>> CATEGORY (a string) is copied in CATEGORY_ANALYZED (field type is text)
>>>>>
>>>>> CATEGORY (a string) is copied in catchall field (field type is text),
>>>> and a
>>>>> lot of other fields are copied too in that catchall field.
>>>>>
>>>>> So as far as I can see, the same analysis should be done in both cases,
>>>> but
>>>>> obviously I'm missing something, and the only thing I can think of is a
>>>>> different behavior between q and fq parameter.
>>>>>
>>>>> I'll check that parsed_filter_querie first thing in the morning next
>>>>> Wednesday.
>>>>>
>>>>> Thanks a lot for your help.
>>>>>
>>>>> Elisabeth
>>>>>
>>>>>
>>>>> 2012/4/24 Erick Erickson <erickerick...@gmail.com>
>>>>>
>>>>>> Elisabeth:
>>>>>>
>>>>>> What shows up in the debug section of the response when you add
>>>>>> &debugQuery=on? There should be some bit of that section like:
>>>>>> "parsed_filter_queries"
>>>>>>
>>>>>> My other question is "are you absolutely sure that your
>>>>>> CATEGORY_ANALYZED field has the correct content?". How does it
>>>>>> get populated?
>>>>>>
>>>>>> Nothing jumps out at me here....
>>>>>>
>>>>>> Best
>>>>>> Erick
>>>>>>
>>>>>> On Tue, Apr 24, 2012 at 9:55 AM, elisabeth benoit
>>>>>> <elisaelisael...@gmail.com> wrote:
>>>>>>> yes, thanks, but this is NOT my question.
>>>>>>>
>>>>>>> I was wondering why I have multiple matches with q="hotel de ville"
>>>> and
>>>>>> no
>>>>>>> match with fq=CATEGORY_ANALYZED:"hotel de ville", since in both case
>>>> I'm
>>>>>>> searching in the same solr fieldType.
>>>>>>>
>>>>>>> Why is q parameter behaving differently in that case? Why do the
>>>> quotes
>>>>>>> work in one case and not in the other?
>>>>>>>
>>>>>>> Does anyone know?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Elisabeth
>>>>>>>
>>>>>>> 2012/4/24 Jeevanandam <je...@myjeeva.com>
>>>>>>>
>>>>>>>>
>>>>>>>> usage of q and fq
>>>>>>>>
>>>>>>>> q => is typically the main query for the search request
>>>>>>>>
>>>>>>>> fq => is Filter Query; generally used to restrict the super set of
>>>>>>>> documents without influencing score (more info.
>>>>>>>> http://wiki.apache.org/solr/**CommonQueryParameters#q<
>>>>>> http://wiki.apache.org/solr/CommonQueryParameters#q>
>>>>>>>> )
>>>>>>>>
>>>>>>>> For example:
>>>>>>>> ------------
>>>>>>>> q="hotel de ville" ===> returns 100 documents
>>>>>>>>
>>>>>>>> q="hotel de ville"&fq=price:[100 To *]&fq=roomType:"King size Bed"
>>>> ===>
>>>>>>>> returns 40 documents from super set of 100 documents
>>>>>>>>
>>>>>>>>
>>>>>>>> hope this helps!
>>>>>>>>
>>>>>>>> - Jeevanandam
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 24-04-2012 3:08 pm, elisabeth benoit wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I'd like to resume this post.
>>>>>>>>>
>>>>>>>>> The only way I found to do not split synonyms in words in
>>>> synonyms.txt
>>>>>> it
>>>>>>>>> to use the line
>>>>>>>>>
>>>>>>>>>  <filter class="solr.**SynonymFilterFactory"
>> synonyms="synonyms.txt"
>>>>>>>>> ignoreCase="true" expand="true"
>>>>>>>>> tokenizerFactory="solr.**KeywordTokenizerFactory"/>
>>>>>>>>>
>>>>>>>>> in schema.xml
>>>>>>>>>
>>>>>>>>> where tokenizerFactory="solr.**KeywordTokenizerFactory"
>>>>>>>>>
>>>>>>>>> instructs SynonymFilterFactory not to break synonyms into words on
>>>>>> white
>>>>>>>>> spaces when parsing synonyms file.
>>>>>>>>>
>>>>>>>>> So now it works fine, "mairie" is mapped into "hotel de ville" and
>>>>>> when I
>>>>>>>>> send request q="hotel de ville" (quotes are mandatory to prevent
>>>>>> analyzer
>>>>>>>>> to split hotel de ville on white spaces), I get answers with word
>>>>>>>>> "mairie".
>>>>>>>>>
>>>>>>>>> But when I use fq parameter (fq=CATEGORY_ANALYZED:"hotel de
>>>> ville"), it
>>>>>>>>> doesn't work!!!
>>>>>>>>>
>>>>>>>>> CATEGORY_ANALYZED is same field type as default search field. This
>>>>>> means
>>>>>>>>> that when I send q="hotel de ville" and fq=CATEGORY_ANALYZED:"hotel
>>>> de
>>>>>>>>> ville", solr uses the same analyzer, the one with the line
>>>>>>>>>
>>>>>>>>> <filter class="solr.**SynonymFilterFactory" synonyms="synonyms.txt"
>>>>>>>>> ignoreCase="true" expand="true"
>>>>>>>>> tokenizerFactory="solr.**KeywordTokenizerFactory"/>.
>>>>>>>>>
>>>>>>>>> Anyone as a clue what is different between q analysis behaviour and
>>>> fq
>>>>>>>>> analysis behaviour?
>>>>>>>>>
>>>>>>>>> Thanks a lot
>>>>>>>>> Elisabeth
>>>>>>>>>
>>>>>>>>> 2012/4/12 elisabeth benoit <elisaelisael...@gmail.com>
>>>>>>>>>
>>>>>>>>>  oh, that's right.
>>>>>>>>>>
>>>>>>>>>> thanks a lot,
>>>>>>>>>> Elisabeth
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2012/4/11 Jeevanandam Madanagopal <je...@myjeeva.com>
>>>>>>>>>>
>>>>>>>>>>  Elisabeth -
>>>>>>>>>>>
>>>>>>>>>>> As you described, below mapping might suit for your need.
>>>>>>>>>>> mairie => hotel de ville, mairie
>>>>>>>>>>>
>>>>>>>>>>> mairie gets expanded to "hotel de ville" and "mairie" at index
>>>> time.
>>>>>>  So
>>>>>>>>>>> "mairie" and "hotel de ville" searchable on document.
>>>>>>>>>>>
>>>>>>>>>>> However, still white space tokenizer splits at query time will be
>>>> a
>>>>>>>>>>> problem as described by Markus.
>>>>>>>>>>>
>>>>>>>>>>> --Jeevanandam
>>>>>>>>>>>
>>>>>>>>>>> On Apr 11, 2012, at 12:30 PM, elisabeth benoit wrote:
>>>>>>>>>>>
>>>>>>>>>>>> <<Have you tried the "=>' mapping instead? Something
>>>>>>>>>>>> <<like
>>>>>>>>>>>> <<hotel de ville => mairie
>>>>>>>>>>>> <<might work for you.
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, thanks, I've tried it but from what I undestand it doesn't
>>>>>> solve
>>>>>>>>>>> my
>>>>>>>>>>>> problem, since this means hotel de ville will be replace by
>>>> mairie
>>>>>> at
>>>>>>>>>>>> index time (I use synonyms only at index time). So when user
>>>> will
>>>>>> ask
>>>>>>>>>>>> "hôtel de ville", it won't match.
>>>>>>>>>>>>
>>>>>>>>>>>> In fact, at index time I have mairie in my data, but I want user
>>>>>> to be
>>>>>>>>>>> able
>>>>>>>>>>>> to request "mairie" or "hôtel de ville" and have mairie as
>>>> answer,
>>>>>> and
>>>>>>>>>>> not
>>>>>>>>>>>> have mairie as an answer when requesting "hôtel".
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> <<To map `mairie` to `hotel de ville` as single token you must
>>>>>> escape
>>>>>>>>>>> your
>>>>>>>>>>>> white
>>>>>>>>>>>> <<space.
>>>>>>>>>>>>
>>>>>>>>>>>> <<mairie, hotel\ de\ ville
>>>>>>>>>>>>
>>>>>>>>>>>> <<This results in  a problem if your tokenizer splits on white
>>>>>> space
>>>>>>>>>>> at
>>>>>>>>>>>> query
>>>>>>>>>>>> <<time.
>>>>>>>>>>>>
>>>>>>>>>>>> Ok, I guess this means I have a problem. No simple solution
>>>> since
>>>>>> at
>>>>>>>>>>> query
>>>>>>>>>>>> time my tokenizer do split on white spaces.
>>>>>>>>>>>>
>>>>>>>>>>>> I guess my problem is more or less one of the problems
>>>> discussed in
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> http://lucene.472066.n3.**nabble.com/Multi-word-**
>>>>>>>>>>> synonyms-td3716292.html#**a3717215<
>>>>>>
>>>>
>> http://lucene.472066.n3.nabble.com/Multi-word-synonyms-td3716292.html#a3717215
>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks a lot for your answers,
>>>>>>>>>>>> Elisabeth
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2012/4/10 Erick Erickson <erickerick...@gmail.com>
>>>>>>>>>>>>
>>>>>>>>>>>>> Have you tried the "=>' mapping instead? Something
>>>>>>>>>>>>> like
>>>>>>>>>>>>> hotel de ville => mairie
>>>>>>>>>>>>> might work for you.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best
>>>>>>>>>>>>> Erick
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Apr 10, 2012 at 1:41 AM, elisabeth benoit
>>>>>>>>>>>>> <elisaelisael...@gmail.com> wrote:
>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've read several post on this issue, but can't find a real
>>>>>> solution
>>>>>>>>>>> to
>>>>>>>>>>>>> my
>>>>>>>>>>>>>> multi-words synonyms matching problem.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have in my synonyms.txt an entry like
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> mairie, hotel de ville
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> and my index time analyzer is configured as followed for
>>>>>> synonyms.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> <filter class="solr.**SynonymFilterFactory"
>>>>>> synonyms="synonyms.txt"
>>>>>>>>>>>>>> ignoreCase="true" expand="true"/>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The problem I have is that now "mairie" matches with "hotel"
>>>> and
>>>>>> I
>>>>>>>>>>> would
>>>>>>>>>>>>>> only want "mairie" to match with "hotel de ville" and
>>>> "mairie".
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When I look into the analyzer, I see that "mairie" is mapped
>>>> into
>>>>>>>>>>>>> "hotel",
>>>>>>>>>>>>>> and words "de ville" are added in second and third position.
>>>> To
>>>>>>>>>>> change
>>>>>>>>>>>>>> that, I tried to do
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> <filter class="solr.**SynonymFilterFactory"
>>>>>> synonyms="synonyms.txt"
>>>>>>>>>>>>>> ignoreCase="true" expand="true"
>>>>>>>>>>>>>> tokenizerFactory="solr.**KeywordTokenizerFactory"/> (as I
>>>> read in
>>>>>>>>>>> one
>>>>>>>>>>> post)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> and I can see now in the analyzer that "mairie" is mapped to
>>>>>> "hotel
>>>>>>>>>>> de
>>>>>>>>>>>>>> ville", but now when I have query "hotel de ville", it doesn't
>>>>>> match
>>>>>>>>>>> at
>>>>>>>>>>>>> all
>>>>>>>>>>>>>> with "mairie".
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Anyone has a clue of what I'm doing wrong?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm using Solr 3.4.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Elisabeth
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>>
>>
>> --
>> *************************************************************
>> Bernd Fehling                Universitätsbibliothek Bielefeld
>> Dipl.-Inform. (FH)                        Universitätsstr. 25
>> Tel. +49 521 106-4060                   Fax. +49 521 106-4052
>> bernd.fehl...@uni-bielefeld.de                33615 Bielefeld
>>
>> BASE - Bielefeld Academic Search Engine - www.base-search.net
>> *************************************************************
>>
> 

-- 
*************************************************************
Bernd Fehling                Universitätsbibliothek Bielefeld
Dipl.-Inform. (FH)                        Universitätsstr. 25
Tel. +49 521 106-4060                   Fax. +49 521 106-4052
bernd.fehl...@uni-bielefeld.de                33615 Bielefeld

BASE - Bielefeld Academic Search Engine - www.base-search.net
*************************************************************

Re: Multi-words synonyms matching

Reply via email to