Re: WordDelimiter filter, expanding to multiple words, unexpected results

Erick Erickson Tue, 02 Sep 2014 10:52:25 -0700

bq: In my actual index, query "MacBook" is matching ONLY "mac book", and
not "macbook"


I suspect your query parameters for WordDelimiterFilterFactory doesn't have
catenate words set.

What do you see when you enter these in both the index and query portions
of the admin/analysis page?

Best,
Erick


On Tue, Sep 2, 2014 at 10:34 AM, Michael Della Bitta <
michael.della.bi...@appinions.com> wrote:

> If that's your problem, I bet all you have to do is twiddle on one of the
> catenate options, either catenateWords or catenateAll.
>
> Michael Della Bitta
>
> Applications Developer
>
> o: +1 646 532 3062
>
> appinions inc.
>
> “The Science of Influence Marketing”
>
> 18 East 41st Street
>
> New York, NY 10017
>
> t: @appinions <https://twitter.com/Appinions> | g+:
> plus.google.com/appinions
> <
> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
> >
> w: appinions.com <http://www.appinions.com/>
>
>
> On Tue, Sep 2, 2014 at 1:07 PM, Jonathan Rochkind <rochk...@jhu.edu>
> wrote:
>
> > Thanks for the response.
> >
> > I understand the problem a little bit better after investigating more.
> >
> > Posting my full field definitions is, I think, going to be confusing, as
> > they are long and complicated. I can narrow it down to an isolation case
> if
> > I need to. My indexed field in question is relatively short strings.
> >
> > But what it's got to do with is the WordDelimiterFilter's default
> > splitOnCaseChange=1 and generateWordParts=1, and the effects of such.
> >
> > Let's take a less confusing example, query "MacBook". With a
> > WordDelimiterFilter followed by something that downcases everything.
> >
> > I think what the WDF (followed by case folding) is trying to do is make
> > query "MacBook" match both indexed text "mac book" as well as "macbook"
> --
> > either one should be a match. Is my understanding right of what
> > WordDelimiterfilter with splitOnCaseChange=1 and generateWordParts=1 is
> > intending to do?
> >
> > In my actual index, query "MacBook" is matching ONLY "mac book", and not
> > "macbook".  Which is unexpected. I indeed want it to match both. (I
> realize
> > I could make it match only 'macbook' by setting splitOnCaseChange=0
> and/or
> > generateWordParts=0).
> >
> > It's possible this is happening as a side effect of other parts of my
> > complex field definition, and I really do need to post hte whole thing
> > and/or isolate it. But I wonder if there are known general problem cases
> > that cause this kind of failure, or any known bugs in WordDelimiterFilter
> > (in Solr 4.3?) that cause this kind of failure.
> >
> > And I wonder if WordDelimiter filter spitting out the token "MacBook"
> with
> > position "2" rather than "1" is expected, irrelevant, or possibly a
> > relevant problem.
> >
> > Thanks again,
> >
> > Jonathan
> >
> >
> > On 9/2/14 12:59 PM, Michael Della Bitta wrote:
> >
> >> Hi Jonathan,
> >>
> >> Little confused by this line:
> >>
> >>  And, what I think it's trying to do, is match text indexed as "d
> elalain"
> >>>
> >> as well as text indexed by "delalain".
> >>
> >> In this case, I don't know how WordDelimiterFilter will help, as you're
> >> likely tokenizing on spaces somewhere, and that input text has a space.
> I
> >> could be wrong. It's probably best if you post your field definition
> from
> >> your schema.
> >>
> >> Also, is this a free-text field, or something that's more like a short
> >> string?
> >>
> >> Thanks,
> >>
> >>
> >> Michael Della Bitta
> >>
> >> Applications Developer
> >>
> >> o: +1 646 532 3062
> >>
> >> appinions inc.
> >>
> >> “The Science of Influence Marketing”
> >>
> >> 18 East 41st Street
> >>
> >> New York, NY 10017
> >>
> >> t: @appinions <https://twitter.com/Appinions> | g+:
> >> plus.google.com/appinions
> >> <https://plus.google.com/u/0/b/112002776285509593336/
> >> 112002776285509593336/posts>
> >> w: appinions.com <http://www.appinions.com/>
> >>
> >>
> >>
> >> On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind <rochk...@jhu.edu>
> >> wrote:
> >>
> >>  Hello, I'm running into a case where a query is not returning the
> results
> >>> I expect, and I'm hoping someone can offer some explanation that might
> >>> help
> >>> me fine tune things or understand what's up.
> >>>
> >>> I am running Solr 4.3.
> >>>
> >>> My filter chain includes a WordDelimiterFilter and, later a filter that
> >>> downcases everything for case-insensitive searching. It includes many
> >>> other
> >>> things too, but I think these are the pertinent facts.
> >>>
> >>> For query "dELALAIN", the WordDelimiterFilter splits into:
> >>>
> >>> text: d
> >>> start: 0
> >>> position: 1
> >>>
> >>> text: ELALAIN
> >>> start: 1
> >>> position: 2
> >>>
> >>> text: dELALAIN
> >>> start: 0
> >>> position: 2
> >>>
> >>> Note the duplication/overlap of the tokens -- one version with "d" and
> >>> "ELALAIN" split into two tokens, and another with just one token.
> >>>
> >>> Later, all the tokens are lowercased by another filter in the chain.
> >>> (actually an ICU filter which is doing something more complicated than
> >>> just
> >>> lowercasing, but I think we can consider it lowercasing for the
> purposes
> >>> of
> >>> this discussion).
> >>>
> >>> If I understand right what the WordDelimiterFilter is trying to do
> here,
> >>> it's probably doing something special because of the lowercase "d"
> >>> followed
> >>> by an uppercase letter, a special case for that. (I don't get this
> >>> behavior
> >>> with other mixed case queries not beginning with 'd').
> >>>
> >>> And, what I think it's trying to do, is match text indexed as "d
> elalain"
> >>> as well as text indexed by "delalain".
> >>>
> >>> The problem is, it's not accomplishing that -- it is NOT matching text
> >>> that was indexed as "delalain" (one token).
> >>>
> >>> I don't entirely understand what the "position" attribute is for --
> but I
> >>> wonder if in this case, the position on "dELALAIN" is really supposed
> to
> >>> be
> >>> 1, not 2?  Could that be responsible for the bug?  Or is position
> >>> irrelevant in this case?
> >>>
> >>> If that's not it, then I'm at a loss as to what may be causing this bug
> >>> --
> >>> or even if it's a bug at all, or I'm just not understanding intended
> >>> behavior. I expect a query for "dELALAIN" to match text indexed as
> >>> "delalain" (because of the forced lowercasing in the filter chain). But
> >>> it's not doing so. Are my expectations wrong? Bug? Something else?
> >>>
> >>> Thanks for any advice,
> >>>
> >>> Jonathan
> >>>
> >>>
> >>
>

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Reply via email to