RE: Solr synonyms format query time vs index time

Steven A Rowe Tue, 17 Aug 2010 11:56:56 -0700

Hi Michael,

I think the problem you're seeing is that no document contains "reebox", and 
you've used the "explicit" syntax (source=>dest) instead of the "equivalent" 
syntax (term,term,term).


I'm guessing that if you convert your synonym file from:

        reebox => Reebok

to:

        reebox, Reebok

and leave expand=true, and then reindex, everything will work: your indexed 
documents containing "Reebok" will be made to include "reebox", so queries for 
"reebox" will produce hits on those documents.

Steve

> -----Original Message-----
> From: mtdowling [mailto:mtdowl...@gmail.com]
> Sent: Tuesday, August 17, 2010 2:24 PM
> To: solr-user@lucene.apache.org
> Subject: Solr synonyms format query time vs index time
> 
> 
> My company recently started using Solr for site search and autocomplete.
> It's working great, but we're running into a problem with synonyms.  We
> are
> generating a synonyms.txt file from a database table and using that
> synonyms.txt file at index time on a text type field.  Here's an excerpt
> from the synonyms file:
> 
> reebox => Reebok
> shinguards => Shin Guards
> shirt => T-Shirt,Shirt
> shmak => Shmack
> shocks => shox
> skateboard => Skate
> skateboarding => Skate
> skater => Skate
> skates => Skate
> skating => Skate
> skirt => Dresses
> 
> When we do a search for reebox, we want the term to be mapped to "Reebok"
> through explicit mapping, but for some reason this isn't happening.  We do
> have multi-word synonyms, and from what I've read on the mailing list,
> those
> only work at index time, so we are only using the synonym filter factory
> at
> index time:
> 
> <fieldType name="search" class="solr.TextField"
> positionIncrementGap="100">
>             <analyzer type="index">
>                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                 <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>                 <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>                 <filter class="solr.SnowballPorterFilterFactory"
> language="English" protected="protwords.txt"/>
>                 <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>             </analyzer>
>             <analyzer type="query">
>                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>                 <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>                 <filter class="solr.SnowballPorterFilterFactory"
> language="English" protected="protwords.txt"/>
>                 <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>             </analyzer>
>         </fieldType>
> 
> Here's more relevant schema.xml configs:
> 
> <field name="mashup" type="search" indexed="true" stored="false"
> multiValued="true"/>
> <copyField source="keywords" dest="mashup"/>
> <copyField source="category" dest="mashup"/>
> <copyField source="name" dest="mashup"/>
> <copyField source="brand" dest="mashup"/>
> <copyField source="description_overview" dest="mashup"/>
> <copyField source="sku" dest="mashup"/>
> <!-- other copy fields... -->
> 
> The output of the query analyzer shows the following:
> 
> Query Analyzer
> org.apache.solr.analysis.WhitespaceTokenizerFactory {}
> term position         1
> term text     reebox
> term type     word
> source start,end      0,6
> payload
> org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
> ignoreCase=true}
> term position         1
> term text     reebox
> term type     word
> source start,end      0,6
> payload
> org.apache.solr.analysis.WordDelimiterFilterFactory
> {generateNumberParts=0,
> catenateWords=1, generateWordParts=0, catenateAll=0, catenateNumbers=1}
> term position         1
> term text     reebox
> term type     word
> source start,end      0,6
> payload
> org.apache.solr.analysis.LowerCaseFilterFactory {}
> term position         1
> term text     reebox
> term type     word
> source start,end      0,6
> payload
> org.apache.solr.analysis.SnowballPorterFilterFactory
> {protected=protwords.txt, language=English}
> term position         1
> term text     reebox
> term type     word
> source start,end      0,6
> payload
> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
> term position         1
> term text     reebox
> term type     word
> source start,end      0,6
> payload
> 
> So "reebox" is never being converted to "Reebok".  I thought that if I had
> index time synonyms with expansion configured that I wouldn't need query
> time synonyms.  Maybe my dynamic synonyms generation isn't formatted
> correctly for my desired result?
> 
> If I use the same synonyms.txt file and use the index analyzer, reebox is
> mapped to Reebok and then indexed correctly:
> 
> Index Analyzer
> org.apache.solr.analysis.WhitespaceTokenizerFactory {}
> term position         1
> term text     reebox
> term type     word
> source start,end      0,6
> payload
> org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
> expand=true, ignoreCase=true}
> term position         1
> term text     Reebok
> term type     word
> source start,end      0,6
> payload
> org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
> ignoreCase=true}
> term position         1
> term text     Reebok
> term type     word
> source start,end      0,6
> payload
> org.apache.solr.analysis.WordDelimiterFilterFactory
> {generateNumberParts=0,
> catenateWords=1, generateWordParts=0, catenateAll=0, catenateNumbers=1}
> term position         1
> term text     Reebok
> term type     word
> source start,end      0,6
> payload
> org.apache.solr.analysis.LowerCaseFilterFactory {}
> term position         1
> term text     reebok
> term type     word
> source start,end      0,6
> payload
> org.apache.solr.analysis.SnowballPorterFilterFactory
> {protected=protwords.txt, language=English}
> term position         1
> term text     reebok
> term type     word
> source start,end      0,6
> payload
> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
> term position         1
> term text     reebok
> term type     word
> source start,end      0,6
> payload
> 
> 
> Should I use equivalent mapping instead of explicit mapping if I'm only
> using index-time synonyms?  Or should I turn query time synonyms on for my
> search field?
> 
> Thanks,
> Michael

RE: Solr synonyms format query time vs index time

Reply via email to