Re: No documents found for some queries with special chars like m&m

Erick Erickson Tue, 27 Aug 2013 15:09:07 -0700

bq: Is there a way I can make "m&m" index as one string AND also keep
StandardTokenizerFactory since I need it for other searches.


In a word, no. You get one and only one tokenizer per field. But there
are lots of options:
> Use a different tokenizer, possibly one of the regex ones.
> fake it with phrase queries.
> Take a really good look at the various filter combinations. It's
   possible that WhitespaceTokenizer and WordDelimiterFilterFactory
   might be able to do good things.
> Clearly define whether this is capability that you really need.

This last is my recurring plea to insure that the effort is of real benefit
to the user and not just something someone noticed that's actually
only useful 0.001% of the time.

Best
Erick


On Tue, Aug 27, 2013 at 5:00 PM, Utkarsh Sengar <utkarsh2...@gmail.com>wrote:

> Yup, the query "o'reilly" worked after adding WDF to the index analyser.
>
>
> Although "m&m" or "m\&m" doesn't work.
> Field analysis for "m&m" says:
> ST m, m
> WDF m, m
>
> ST m, m
> WDF m, m
>
> So essentially & is ignored during the index or the query. My guess is, the
> standard tokenize is the problem. As the documentation says:
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StandardTokenizerFactory
> Example: "I.B.M. 8.5 can't!!!" ==> ALPHANUM: "I.B.M.", NUM:"8.5",
> ALPHANUM:"can't"
>
> The char "&" will be ignored I guess.
>
> *So, my question is:*
> Is there a way I can make "m&m" index as one string AND also keep
> StandardTokenizerFactory since I need it for other searches.
>
> Thanks,
> -Utkarsh
>
>
> On Tue, Aug 27, 2013 at 11:44 AM, Utkarsh Sengar <utkarsh2...@gmail.com
> >wrote:
>
> > Thanks for the info.
> >
> > 1.
> >
> http://SERVER/solr/prodinfo/select?q=o%27reilly&wt=json&indent=true&debugQuery=truereturn
> :
> >
> > {
> >   "responseHeader":{
> >     "status":0,
> >     "QTime":16,
> >     "params":{
> >       "debugQuery":"true",
> >       "indent":"true",
> >       "q":"o'reilly",
> >       "wt":"json"}},
> >   "response":{"numFound":0,"start":0,"maxScore":0.0,"docs":[]
> >   },
> >   "debug":{
> >     "rawquerystring":"o'reilly",
> >     "querystring":"o'reilly",
> >     "parsedquery":"MultiPhraseQuery(allText:\"o'reilly (reilly
> oreilly)\")",
> >     "parsedquery_toString":"allText:\"o'reilly (reilly oreilly)\"",
> >     "QParser":"LuceneQParser",
> >     "explain":{}
> >    }
> > }
> >
> >
> >
> > 2. Analysis gives this: http://i.imgur.com/IPEiiEQ.png I assume this
> > means tokens are same for "o'reilly"
> > 3. I tried escaping ', it doesn’t help:
> > http://SERVER/solr/prodinfo/select?q=o\%27reilly&wt=json&indent=true<
> http://SERVER/solr/prodinfo/select?q=o%5C%27reilly&wt=json&indent=true>
> >
> > I will add WordDelimiterFilterFactory for index and see if it fixes the
> > problem.
> >
> > Thanks,
> > -Utkarsh
> >
> >
> >
> > On Mon, Aug 26, 2013 at 3:15 PM, Erick Erickson <erickerick...@gmail.com
> >wrote:
> >
> >> First thing to do is attach &query=debug to your queries and look at the
> >> parsed output.
> >>
> >> Second thing to do is look at the admin/analysis page and see what
> happens
> >> at index and query time to things like o'reilly. You have
> >> WordDelimiterFilterFactory
> >> configured in your query but not index analysis chain. My bet on that is
> >> that
> >> you're getting different tokens at query and index time...
> >>
> >> Third thing is that you need to escape the & character. It's probably
> >> being
> >> interpreted as a delimiter on the URL and Solr ignores params it doesn't
> >> understand.
> >>
> >> Best
> >> Erick
> >>
> >>
> >> On Mon, Aug 26, 2013 at 5:08 PM, Utkarsh Sengar <utkarsh2...@gmail.com
> >> >wrote:
> >>
> >> > Some of the queries (not all) with special chars return no documents.
> >> >
> >> > Example: queries returning no documents
> >> > q=m&m (this can be explained, when I search for "m m", no documents
> are
> >> > returned)
> >> > q=o'reilly (when I search for "o reilly", I get documents back)
> >> >
> >> >
> >> > Queries returning documents:
> >> > q=hello&world (document matched is "Hello World: A Life in Ham Radio")
> >> >
> >> >
> >> > My questions are:
> >> > 1. What's wrong with "o'reilly"? What changes do I need in my field
> >> type?
> >> > 2. How can I make the query "m&m" work?
> >> > My indexe has a bunch of M&M's docs like: "M & M's Milk Chocolate
> Candy
> >> > Coated Peanuts  19.2 oz" and ""M and Ms Chocolate Candies - Peanut - 1
> >> Bag
> >> > (42 oz)"
> >> >
> >> >
> >> > FIeld type:
> >> >         <fieldType name="text_general" class="solr.TextField"
> >> > positionIncrementGap="100">
> >> >              <analyzer type="index">
> >> >                   <tokenizer class="solr.StandardTokenizerFactory"/>
> >> >                   <filter class="solr.StopFilterFactory"
> >> ignoreCase="true"
> >> > words="stopwords.txt" enablePositionIncrements="true" />
> >> >                   <filter class="solr.LowerCaseFilterFactory"/>
> >> >                   <filter
> class="solr.EnglishMinimalStemFilterFactory"/>
> >> >                   <filter class="solr.ASCIIFoldingFilterFactory"/>
> >> >                   <filter
> >> class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >> >             </analyzer>
> >> >             <analyzer type="query">
> >> >                   <filter class="solr.WordDelimiterFilterFactory"
> >> > generateWordParts="1" generateNumberParts="1"
> >> >
> >> > catenateWords="1"
> >> >
> >> > catenateNumbers="1"
> >> >
> >> > catenateAll="0"
> >> >
> >> > preserveOriginal="1"/>
> >> >                   <tokenizer class="solr.StandardTokenizerFactory"/>
> >> >                   <filter class="solr.StopFilterFactory"
> >> ignoreCase="true"
> >> > words="stopwords.txt" enablePositionIncrements="true" />
> >> >                   <filter class="solr.LowerCaseFilterFactory"/>
> >> >                   <filter
> class="solr.EnglishMinimalStemFilterFactory"/>
> >> >                   <filter class="solr.ASCIIFoldingFilterFactory"/>
> >> >                   <filter
> >> class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >> >             </analyzer>
> >> >         </fieldType>
> >> >
> >> >
> >> > --
> >> > Thanks,
> >> > -Utkarsh
> >> >
> >>
> >
> >
> >
> > --
> > Thanks,
> > -Utkarsh
> >
>
>
>
> --
> Thanks,
> -Utkarsh
>

Re: No documents found for some queries with special chars like m&m

Reply via email to