Re: Solr 4.8: Does eDisMax parser calls analyzer chain to tokenize?

Michael Sokolov Sat, 17 May 2014 06:43:25 -0700

Alex - the query parsers generally accept an analyzer, which they mustapply after they perform their own tokenization. Consider: how would acapitalized query term match lower-cased terms in the index withoutquery analysis?


-Mike


On 5/17/2014 4:05 AM, Alexandre Rafalovitch wrote:

Hello,

I am getting weird results that seem to come from eDisMax using
analyzer chain to break the input text. I have
WordDelimiterFilterFactory in my chain, which does a lot of
interesting things I did not expect query parser to be involved in.

Specifically, the string "abc123XYZ" gets split into 3 components on
digits and gets lowercased as well. I thought all that was happening
later, inside individual fields.

All documentation talks about query parsers splitting on space, so I
don't know where this "full chain" business is coming from. Or maybe I
am misunderstanding which phase debug output is from.

Here is the field definition:
     <fieldType name="wdText" class="solr.TextField" >
         <analyzer>
             <tokenizer class="solr.WhitespaceTokenizerFactory"/>
             <filter class="solr.WordDelimiterFilterFactory"
preserveOriginal="1" />
             <filter class="solr.LowerCaseFilterFactory" />
         </analyzer>
     </fieldType>
     <fieldType name="wsText" class="solr.TextField" positionIncrementGap="100">
       <analyzer>
         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
       </analyzer>
     </fieldType>

     <field name="wdText"      type="wdText" indexed="true" stored="true" />
     <field name="wsText"      type="wsText" indexed="true" stored="true" />

And here is the debug output:
http://localhost:9000/solr/collection1/select?q=hello+big+world+abc123XYZ&wt=json&indent=true&debugQuery=true&defType=edismax&qf=wdText+wsText&stopwords=true&lowercaseOperators=true

    "rawquerystring":"hello big world abc123XYZ",
     "querystring":"hello big world abc123XYZ",
     "parsedquery":"(+(DisjunctionMaxQuery((wdText:hello |
wsText:hello)) DisjunctionMaxQuery((wdText:big | wsText:big))
DisjunctionMaxQuery((wdText:world | wsText:world))
DisjunctionMaxQuery((((wdText:abc123xyz wdText:abc) wdText:123
wdText:xyz) | wsText:abc123XYZ))))/no_coord",
     "parsedquery_toString":"+((wdText:hello | wsText:hello)
(wdText:big | wsText:big) (wdText:world | wsText:world)
(((wdText:abc123xyz wdText:abc) wdText:123 wdText:xyz) |
wsText:abc123XYZ))",

Or, and enabling phrase search on the field type, gets even more
weird. But one problem at a time.

Regards,
    Alex.

Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency

Re: Solr 4.8: Does eDisMax parser calls analyzer chain to tokenize?

Reply via email to