Thanks for the response.
I understand the problem a little bit better after investigating more.
Posting my full field definitions is, I think, going to be confusing, as
they are long and complicated. I can narrow it down to an isolation case
if I need to. My indexed field in question is relatively short strings.
But what it's got to do with is the WordDelimiterFilter's default
splitOnCaseChange=1 and generateWordParts=1, and the effects of such.
Let's take a less confusing example, query "MacBook". With a
WordDelimiterFilter followed by something that downcases everything.
I think what the WDF (followed by case folding) is trying to do is make
query "MacBook" match both indexed text "mac book" as well as "macbook"
-- either one should be a match. Is my understanding right of what
WordDelimiterfilter with splitOnCaseChange=1 and generateWordParts=1 is
intending to do?
In my actual index, query "MacBook" is matching ONLY "mac book", and not
"macbook". Which is unexpected. I indeed want it to match both. (I
realize I could make it match only 'macbook' by setting
splitOnCaseChange=0 and/or generateWordParts=0).
It's possible this is happening as a side effect of other parts of my
complex field definition, and I really do need to post hte whole thing
and/or isolate it. But I wonder if there are known general problem cases
that cause this kind of failure, or any known bugs in
WordDelimiterFilter (in Solr 4.3?) that cause this kind of failure.
And I wonder if WordDelimiter filter spitting out the token "MacBook"
with position "2" rather than "1" is expected, irrelevant, or possibly a
relevant problem.
Thanks again,
Jonathan
On 9/2/14 12:59 PM, Michael Della Bitta wrote:
Hi Jonathan,
Little confused by this line:
And, what I think it's trying to do, is match text indexed as "d elalain"
as well as text indexed by "delalain".
In this case, I don't know how WordDelimiterFilter will help, as you're
likely tokenizing on spaces somewhere, and that input text has a space. I
could be wrong. It's probably best if you post your field definition from
your schema.
Also, is this a free-text field, or something that's more like a short
string?
Thanks,
Michael Della Bitta
Applications Developer
o: +1 646 532 3062
appinions inc.
“The Science of Influence Marketing”
18 East 41st Street
New York, NY 10017
t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>
On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind <rochk...@jhu.edu> wrote:
Hello, I'm running into a case where a query is not returning the results
I expect, and I'm hoping someone can offer some explanation that might help
me fine tune things or understand what's up.
I am running Solr 4.3.
My filter chain includes a WordDelimiterFilter and, later a filter that
downcases everything for case-insensitive searching. It includes many other
things too, but I think these are the pertinent facts.
For query "dELALAIN", the WordDelimiterFilter splits into:
text: d
start: 0
position: 1
text: ELALAIN
start: 1
position: 2
text: dELALAIN
start: 0
position: 2
Note the duplication/overlap of the tokens -- one version with "d" and
"ELALAIN" split into two tokens, and another with just one token.
Later, all the tokens are lowercased by another filter in the chain.
(actually an ICU filter which is doing something more complicated than just
lowercasing, but I think we can consider it lowercasing for the purposes of
this discussion).
If I understand right what the WordDelimiterFilter is trying to do here,
it's probably doing something special because of the lowercase "d" followed
by an uppercase letter, a special case for that. (I don't get this behavior
with other mixed case queries not beginning with 'd').
And, what I think it's trying to do, is match text indexed as "d elalain"
as well as text indexed by "delalain".
The problem is, it's not accomplishing that -- it is NOT matching text
that was indexed as "delalain" (one token).
I don't entirely understand what the "position" attribute is for -- but I
wonder if in this case, the position on "dELALAIN" is really supposed to be
1, not 2? Could that be responsible for the bug? Or is position
irrelevant in this case?
If that's not it, then I'm at a loss as to what may be causing this bug --
or even if it's a bug at all, or I'm just not understanding intended
behavior. I expect a query for "dELALAIN" to match text indexed as
"delalain" (because of the forced lowercasing in the filter chain). But
it's not doing so. Are my expectations wrong? Bug? Something else?
Thanks for any advice,
Jonathan