Re: WordDelimiter filter, expanding to multiple words, unexpected results

Jonathan Rochkind Tue, 02 Sep 2014 10:08:47 -0700

Thanks for the response.

I understand the problem a little bit better after investigating more.

Posting my full field definitions is, I think, going to be confusing, asthey are long and complicated. I can narrow it down to an isolation caseif I need to. My indexed field in question is relatively short strings.

But what it's got to do with is the WordDelimiterFilter's defaultsplitOnCaseChange=1 and generateWordParts=1, and the effects of such.

Let's take a less confusing example, query "MacBook". With aWordDelimiterFilter followed by something that downcases everything.

I think what the WDF (followed by case folding) is trying to do is makequery "MacBook" match both indexed text "mac book" as well as "macbook"-- either one should be a match. Is my understanding right of whatWordDelimiterfilter with splitOnCaseChange=1 and generateWordParts=1 isintending to do?

In my actual index, query "MacBook" is matching ONLY "mac book", and not"macbook". Which is unexpected. I indeed want it to match both. (Irealize I could make it match only 'macbook' by settingsplitOnCaseChange=0 and/or generateWordParts=0).

It's possible this is happening as a side effect of other parts of mycomplex field definition, and I really do need to post hte whole thingand/or isolate it. But I wonder if there are known general problem casesthat cause this kind of failure, or any known bugs inWordDelimiterFilter (in Solr 4.3?) that cause this kind of failure.

And I wonder if WordDelimiter filter spitting out the token "MacBook"with position "2" rather than "1" is expected, irrelevant, or possibly arelevant problem.


Thanks again,

Jonathan

On 9/2/14 12:59 PM, Michael Della Bitta wrote:

Hi Jonathan,

Little confused by this line:

And, what I think it's trying to do, is match text indexed as "d elalain"

as well as text indexed by "delalain".

In this case, I don't know how WordDelimiterFilter will help, as you're
likely tokenizing on spaces somewhere, and that input text has a space. I
could be wrong. It's probably best if you post your field definition from
your schema.

Also, is this a free-text field, or something that's more like a short
string?

Thanks,


Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind <[email protected]> wrote:

Hello, I'm running into a case where a query is not returning the results
I expect, and I'm hoping someone can offer some explanation that might help
me fine tune things or understand what's up.

I am running Solr 4.3.

My filter chain includes a WordDelimiterFilter and, later a filter that
downcases everything for case-insensitive searching. It includes many other
things too, but I think these are the pertinent facts.

For query "dELALAIN", the WordDelimiterFilter splits into:

text: d
start: 0
position: 1

text: ELALAIN
start: 1
position: 2

text: dELALAIN
start: 0
position: 2

Note the duplication/overlap of the tokens -- one version with "d" and
"ELALAIN" split into two tokens, and another with just one token.

Later, all the tokens are lowercased by another filter in the chain.
(actually an ICU filter which is doing something more complicated than just
lowercasing, but I think we can consider it lowercasing for the purposes of
this discussion).

If I understand right what the WordDelimiterFilter is trying to do here,
it's probably doing something special because of the lowercase "d" followed
by an uppercase letter, a special case for that. (I don't get this behavior
with other mixed case queries not beginning with 'd').

And, what I think it's trying to do, is match text indexed as "d elalain"
as well as text indexed by "delalain".

The problem is, it's not accomplishing that -- it is NOT matching text
that was indexed as "delalain" (one token).

I don't entirely understand what the "position" attribute is for -- but I
wonder if in this case, the position on "dELALAIN" is really supposed to be
1, not 2?  Could that be responsible for the bug?  Or is position
irrelevant in this case?

If that's not it, then I'm at a loss as to what may be causing this bug --
or even if it's a bug at all, or I'm just not understanding intended
behavior. I expect a query for "dELALAIN" to match text indexed as
"delalain" (because of the forced lowercasing in the filter chain). But
it's not doing so. Are my expectations wrong? Bug? Something else?

Thanks for any advice,

Jonathan

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Reply via email to