Alex - the query parsers generally accept an analyzer, which they must
apply after they perform their own tokenization. Consider: how would a
capitalized query term match lower-cased terms in the index without
query analysis?
-Mike
On 5/17/2014 4:05 AM, Alexandre Rafalovitch wrote:
Hello,
I am getting weird results that seem to come from eDisMax using
analyzer chain to break the input text. I have
WordDelimiterFilterFactory in my chain, which does a lot of
interesting things I did not expect query parser to be involved in.
Specifically, the string "abc123XYZ" gets split into 3 components on
digits and gets lowercased as well. I thought all that was happening
later, inside individual fields.
All documentation talks about query parsers splitting on space, so I
don't know where this "full chain" business is coming from. Or maybe I
am misunderstanding which phase debug output is from.
Here is the field definition:
<fieldType name="wdText" class="solr.TextField" >
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
preserveOriginal="1" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<fieldType name="wsText" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
<field name="wdText" type="wdText" indexed="true" stored="true" />
<field name="wsText" type="wsText" indexed="true" stored="true" />
And here is the debug output:
http://localhost:9000/solr/collection1/select?q=hello+big+world+abc123XYZ&wt=json&indent=true&debugQuery=true&defType=edismax&qf=wdText+wsText&stopwords=true&lowercaseOperators=true
"rawquerystring":"hello big world abc123XYZ",
"querystring":"hello big world abc123XYZ",
"parsedquery":"(+(DisjunctionMaxQuery((wdText:hello |
wsText:hello)) DisjunctionMaxQuery((wdText:big | wsText:big))
DisjunctionMaxQuery((wdText:world | wsText:world))
DisjunctionMaxQuery((((wdText:abc123xyz wdText:abc) wdText:123
wdText:xyz) | wsText:abc123XYZ))))/no_coord",
"parsedquery_toString":"+((wdText:hello | wsText:hello)
(wdText:big | wsText:big) (wdText:world | wsText:world)
(((wdText:abc123xyz wdText:abc) wdText:123 wdText:xyz) |
wsText:abc123XYZ))",
Or, and enabling phrase search on the field type, gets even more
weird. But one problem at a time.
Regards,
Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency