WordDelimiter filter, expanding to multiple words, unexpected results

Jonathan Rochkind Tue, 02 Sep 2014 09:42:41 -0700

Hello, I'm running into a case where a query is not returning theresults I expect, and I'm hoping someone can offer some explanation thatmight help me fine tune things or understand what's up.


I am running Solr 4.3.

My filter chain includes a WordDelimiterFilter and, later a filter thatdowncases everything for case-insensitive searching. It includes manyother things too, but I think these are the pertinent facts.


For query "dELALAIN", the WordDelimiterFilter splits into:

text: d
start: 0
position: 1

text: ELALAIN
start: 1
position: 2

text: dELALAIN
start: 0
position: 2

Note the duplication/overlap of the tokens -- one version with "d" and"ELALAIN" split into two tokens, and another with just one token.

Later, all the tokens are lowercased by another filter in the chain.(actually an ICU filter which is doing something more complicated thanjust lowercasing, but I think we can consider it lowercasing for thepurposes of this discussion).

If I understand right what the WordDelimiterFilter is trying to do here,it's probably doing something special because of the lowercase "d"followed by an uppercase letter, a special case for that. (I don't getthis behavior with other mixed case queries not beginning with 'd').

And, what I think it's trying to do, is match text indexed as "delalain" as well as text indexed by "delalain".

The problem is, it's not accomplishing that -- it is NOT matching textthat was indexed as "delalain" (one token).

I don't entirely understand what the "position" attribute is for -- butI wonder if in this case, the position on "dELALAIN" is really supposedto be 1, not 2? Could that be responsible for the bug? Or is positionirrelevant in this case?

If that's not it, then I'm at a loss as to what may be causing this bug-- or even if it's a bug at all, or I'm just not understanding intendedbehavior. I expect a query for "dELALAIN" to match text indexed as"delalain" (because of the forced lowercasing in the filter chain). Butit's not doing so. Are my expectations wrong? Bug? Something else?


Thanks for any advice,

Jonathan

WordDelimiter filter, expanding to multiple words, unexpected results

Reply via email to