RE: Searching words with spaces for word without spaces in solr

Dyer, James Thu, 31 Jul 2014 09:03:29 -0700

If a user is searching on "ice cream" but your index has "icecream", you can 
treat this like a spelling error.  WordBreakSolrSpellChecker would identify the 
fact that  while "ice cream" is not in your index, "icecream" and then you can 
re-query for the corrected version without the space.


The problem with solving this with analyers, is that you can analyze 
"ice-cream" as either "ice cream" or "icecream" (split or catenate on hyphen).  
You can even analyze "IceCream > Ice Cream" (catenate on case change).  But how 
is your analyzer going to know that "icecream" should index as two tokens: 
"ice" "cream" ?  You're asking analysis to do too much in this case.  This is 
where spellcheck can bridge the gap.

Of course, if you have a discrete list of words you want split like this, then 
you can do it with analysis using index-time synonyms.  In this case, you need 
to provide it with the list.  See 
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
 for more information.

James Dyer
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: sunshine glass [mailto:sunshineglassof2...@gmail.com] 
Sent: Thursday, July 31, 2014 10:32 AM
To: solr-user@lucene.apache.org
Subject: Re: Searching words with spaces for word without spaces in solr

I am not clear with this. This link is related to spell check. Can you
elaborate it more ?


On Wed, Jul 30, 2014 at 9:17 PM, Dyer, James <james.d...@ingramcontent.com>
wrote:

> In addition to the analyzer configuration you're using, you might want to
> also use WordBreakSolrSpellChecker to catch possible matches that can't
> easily be solved through analysis.  For more information, see the section
> for it at https://cwiki.apache.org/confluence/display/solr/Spell+Checking
>
> James Dyer
> Ingram Content Group
> (615) 213-4311
>
> -----Original Message-----
> From: sunshine glass [mailto:sunshineglassof2...@gmail.com]
> Sent: Wednesday, July 30, 2014 9:38 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Searching words with spaces for word without spaces in solr
>
> This is the new configuration:
>
>     <fieldType name="text" class="solr.TextField"
> > positionIncrementGap="100">
> >       <analyzer type="index">
> >         <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >         <tokenizer class="solr.StandardTokenizerFactory"/>
> >         <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
> > outputUnigrams="true" tokenSeparator=""/>
> >         <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >         <filter class="solr.LowerCaseFilterFactory"/>
> >         <filter class="solr.SnowballPorterFilterFactory"
> > language="English" protected="protwords.txt"/>
> >           <filter class="solr.SynonymFilterFactory"
> > synonyms="stemmed_synonyms_text_prime_index.txt" ignoreCase="true"
> > expand="true"/>
> >       </analyzer>
> >       <analyzer type="query">
> >         <tokenizer class="solr.StandardTokenizerFactory"/>
> >         <filter class="solr.LowerCaseFilterFactory"/>
> >         <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords_text_prime_search.txt" enablePositionIncrements="true"
> />
> >         <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
> > outputUnigrams="true" tokenSeparator=""/>
> >         <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
> >         <filter class="solr.SnowballPorterFilterFactory"
> > language="English" protected="protwords.txt"/>
> >       </fieldType>
> >
> >
> These are current docs in my index:
>
> <result name="response" numFound="3" start="0">
> <doc>
> <str name="id">2</str>
> <str name="title">Icecream</str>
> <long name="_version_">1475063961342705664</long>
> </doc>
> <doc>
> <str name="id">3</str>
> <str name="title">Ice-cream</str>
> <long name="_version_">1475063961344802816</long>
> </doc>
> <doc>
> <str name="id">1</str>
> <str name="title">Ice Cream</str>
> <long name="_version_">1475063961203245056</long>
> </doc>
> </result>
> </response>
>
> Query:
> http://localhost:8983/solr/collection1/select?q=title:ice+cream&debug=true
>
> Response:
>
> <result name="response" numFound="2" start="0">
> <doc>
> <str name="id">1</str>
> <str name="title">Ice Cream</str>
> <long name="_version_">1475063961203245056</long>
> </doc>
> <doc>
> <str name="id">3</str>
> <str name="title">Ice-cream</str>
> <long name="_version_">1475063961344802816</long>
> </doc>
> </result>
> <lst name="debug">
> <str name="rawquerystring">title:ice cream</str>
> <str name="querystring">title:ice cream</str>
> <str name="parsedquery">
> (+(title:ice DisjunctionMaxQuery((title:cream))))/no_coord
> </str>
> <str name="parsedquery_toString">+(title:ice (title:cream))</str>
> <lst name="explain">
> <str name="1">
> 0.875 = (MATCH) sum of: 0.4375 = (MATCH) weight(title:ice in 0)
> [DefaultSimilarity], result of: 0.4375 = score(doc=0,freq=2.0 =
> termFreq=2.0 ), product of: 0.70710677 = queryWeight, product of: 1.0 =
> idf(docFreq=2, maxDocs=3) 0.70710677 = queryNorm 0.61871845 = fieldWeight
> in 0, product of: 1.4142135 = tf(freq=2.0), with freq of: 2.0 =
> termFreq=2.0 1.0 = idf(docFreq=2, maxDocs=3) 0.4375 = fieldNorm(doc=0)
> 0.4375 = (MATCH) weight(title:cream in 0) [DefaultSimilarity], result of:
> 0.4375 = score(doc=0,freq=2.0 = termFreq=2.0 ), product of: 0.70710677 =
> queryWeight, product of: 1.0 = idf(docFreq=2, maxDocs=3) 0.70710677 =
> queryNorm 0.61871845 = fieldWeight in 0, product of: 1.4142135 =
> tf(freq=2.0), with freq of: 2.0 = termFreq=2.0 1.0 = idf(docFreq=2,
> maxDocs=3) 0.4375 = fieldNorm(doc=0)
> </str>
> <str name="3">
> 0.70710677 = (MATCH) sum of: 0.35355338 = (MATCH) weight(title:ice in 2)
> [DefaultSimilarity], result of: 0.35355338 = score(doc=2,freq=1.0 =
> termFreq=1.0 ), product of: 0.70710677 = queryWeight, product of: 1.0 =
> idf(docFreq=2, maxDocs=3) 0.70710677 = queryNorm 0.5 = fieldWeight in 2,
> product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 1.0 =
> idf(docFreq=2, maxDocs=3) 0.5 = fieldNorm(doc=2) 0.35355338 = (MATCH)
> weight(title:cream in 2) [DefaultSimilarity], result of: 0.35355338 =
> score(doc=2,freq=1.0 = termFreq=1.0 ), product of: 0.70710677 =
> queryWeight, product of: 1.0 = idf(docFreq=2, maxDocs=3) 0.70710677 =
> queryNorm 0.5 = fieldWeight in 2, product of: 1.0 = tf(freq=1.0), with freq
> of: 1.0 = termFreq=1.0 1.0 = idf(docFreq=2, maxDocs=3) 0.5 =
> fieldNorm(doc=2)
> </str>
> </lst>
>
> Still not working ????
>
>
> On Fri, May 30, 2014 at 9:21 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
> > I'd spend some time with the admin/analysis page to understand the exact
> > tokenization going on here. For instance, sequencing the
> > shinglefilterfactory before worddelimiterfilterfactory may produce
> > "interesting" resutls. And then throwing the Snowball factory at it and
> > putting synonyms in front.... I suspect you're not indexing or searching
> > what you think you are.
> >
> > Second, what happens when you query with &debug=query? That'll show you
> > what the search string looks like.
> >
> > If that doesn't help, please post the results of looking at those things
> > here, that'll provide some information for us to work with.
> >
> > Best,
> > Erick
> >
> >
> > On Fri, May 30, 2014 at 3:32 AM, sunshine glass <
> > sunshineglassof2...@gmail.com> wrote:
> >
> > > Hi Folks,
> > >
> > > Any updates ??
> > >
> > >
> > > On Wed, May 28, 2014 at 12:13 PM, sunshine glass <
> > > sunshineglassof2...@gmail.com> wrote:
> > >
> > > > Dear Team,
> > > >
> > > > How can I handle compound word searches in solr ?.
> > > > How can i search "hand bag" if I have "handbag" in my index. While
> > using
> > > > shingle in query analyzer, the query "ice cube" creates three tokens
> as
> > > > "ice","cube", "icecube". Only ice and cubes are searched but not
> > > > "icecubes".i.e not working for pair though I am using shingle filter.
> > > >
> > > > Here's the schema config.
> > > >
> > > >
> > > >    1.  <fieldType name="text" class="solr.TextField"
> > > >    positionIncrementGap="100">
> > > >    2.       <analyzer type="index">
> > > >    3.         <filter class="solr.SynonymFilterFactory"
> > > >    synonyms="synonyms_text_prime_index.txt" ignoreCase="true"
> > > expand="true"/>
> > > >    4.         <charFilter class="solr.HTMLStripCharFilterFactory"/>
> > > >    5.         <tokenizer class="solr.StandardTokenizerFactory"/>
> > > >    6.          <filter class="solr.ShingleFilterFactory"
> > > >    maxShingleSize="2" outputUnigrams="true" tokenSeparator=""/>
> > > >    7.          <filter class="solr.WordDelimiterFilterFactory"
> > > >    catenateWords="1" catenateNumbers="1" catenateAll="1"
> > > preserveOriginal="1"
> > > >    generateWordParts="1" generateNumberParts="1"/>
> > > >    8.         <filter class="solr.LowerCaseFilterFactory"/>
> > > >    9.         <filter class="solr.SnowballPorterFilterFactory"
> > > >    language="English" protected="protwords.txt"/>
> > > >    10.       </analyzer>
> > > >    11.       <analyzer type="query">
> > > >    12.         <tokenizer class="solr.StandardTokenizerFactory"/>
> > > >    13.         <filter class="solr.SynonymFilterFactory"
> > > >    synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> > > >    14.         <filter class="solr.ShingleFilterFactory"
> > > >    maxShingleSize="2" outputUnigrams="true" tokenSeparator=""/>
> > > >    15.         <filter class="solr.WordDelimiterFilterFactory"
> > > >    preserveOriginal="1"/>
> > > >    16.         <filter class="solr.LowerCaseFilterFactory"/>
> > > >    17.         <filter class="solr.SnowballPorterFilterFactory"
> > > >    language="English" protected="protwords.txt"/>
> > > >    18.       </analyzer>
> > > >    19.     </fieldType>
> > > >
> > > >    Any help is appreciated.
> > > >
> > > >
> > >
> >
>

RE: Searching words with spaces for word without spaces in solr

Reply via email to