Re: WordDelimiter filter, expanding to multiple words, unexpected results

Jack Krupansky Tue, 30 Dec 2014 09:03:23 -0800

I do have a more thorough discussion of WDF in my Solr Deep Dive e-book:
http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html


You're not "wrong" about anything here... you just need to accept that WDF
is not magic and can't handle every use can that anybody can imagine.

And you do need to be careful about interactions between the query parser
and the analyzers, especially in these kinds of cases where a single term
might generate multiple terms.

Some of these features really are only suitable for advanced, "expert"
users.

Note that one of the features that Solr is missing is support for the
Google-like feature of splitting concatenated words (regardless of case.)
That's worthy of a Jira.


-- Jack Krupansky

On Tue, Dec 30, 2014 at 11:44 AM, Jonathan Rochkind <rochk...@jhu.edu>
wrote:

> I guess I don't understand what the four use cases are, or the three out
> of four use cases, or whatever. What the intended uses of the WDF are.
>
> Can you explain what the intended use of setting:
>
> generateWordParts="1" catenateWords="1" splitOnCaseChange="1"
>
> Is that supposed to do something useful (at either query or index time),
> or is that a nonsensical configuration that nobody should ever use?
>
> I understand how analysis can be different at index vs query time. I think
> what I don't fully understand is what the possibilities and intended use
> case of the WDF are, with various configurations.
>
> I thought one of the intended use cases, with appropriate configuration,
> was to do what I'm talking: allow "mixedCase" query to match both "mixed
> Case" and "mixed Case" in the index. I think you're saying I'm wrong, and
> this is not something WDF can do? Can you confirm I understand you right?
>
> Thanks!
>
> Jonathan
>
>
> On 12/30/14 11:30 AM, Jack Krupansky wrote:
>
>> Right, that's what I meant by WDF not being "magic" - you can configure it
>> to match any three out of four use cases as you choose, but there is no
>> choice that matches all of the use cases.
>>
>> To be clear, this is not a "bug" in WDF, but simply a limitation.
>>
>>
>> -- Jack Krupansky
>>
>> On Tue, Dec 30, 2014 at 11:12 AM, Jonathan Rochkind <rochk...@jhu.edu>
>> wrote:
>>
>>  Thanks Erick!
>>>
>>> Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then
>>> query for "mixedCase" will no longer also match "mixed Case".
>>>
>>> I think I want WDF to... kind of do all of the above.
>>>
>>> Specifically, I had thought that it would allow a query for "mixedCase"
>>> to
>>> match both/either "mixed Case" or "mixedCase" in the index. (with case
>>> insensitivity on top of that via another filter).
>>>
>>> That would support things like names like "duBois" which are sometimes
>>> spelled "du bois" and sometimes "dubois", and allow the query "duBois" to
>>> match both in the index.
>>>
>>> I had somehow thought that was what WDF was intended for. But it's
>>> actually not the usual functioning, and may not be realistic?
>>>
>>> I'm a bit confused about what splitOnCaseChange combined with
>>> catenateWords is meant to do at all.  It _is_ generating both the split
>>> and
>>> single-word tokens at query time -- but not in a way that actually allows
>>> it to match both the split and single-word tokens?  What is supposed to
>>> be
>>> the purpose/use case for splitOnCaseChange with catenateWords? If any?
>>>
>>> Jonathan
>>>
>>>
>>> On 12/29/14 7:20 PM, Erick Erickson wrote:
>>>
>>>  Jonathan:
>>>>
>>>> Well, it works if you set splitOnCaseChange="0" in just the query part
>>>> of the analysis chain. I probably mislead you a bit months ago, WDFF
>>>> is intended for this case iff you expect the case change to generate
>>>> _tokens_ that are individually meaningful.. And unfortunately
>>>> "significant" in one case will be not-significant in others.
>>>>
>>>> So what kinds of things do you want WDFF to handle? Case changes?
>>>> Letter/non-letter transitions? All of the above?
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>>
>>>>
>>>> On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind <rochk...@jhu.edu>
>>>> wrote:
>>>>
>>>>  On 12/29/14 5:24 PM, Jack Krupansky wrote:
>>>>>
>>>>>
>>>>>> WDF is powerful, but it is not magic. In general, the indexed data is
>>>>>> expected to be clean while the query might be sloppy. You need to
>>>>>> separate
>>>>>> the index and query analyzers and they need to respect that
>>>>>> distinction
>>>>>>
>>>>>>
>>>>>
>>>>> I do not understand what separate query/index analysis you are
>>>>> suggesting to
>>>>> accomplish what I wanted.
>>>>>
>>>>> I understand the WDF, like all software, is not magic, of course. But I
>>>>> thought this was an intended use case of the WDF, with those settings:
>>>>>
>>>>> A "mixedCase" query would match "mixedCase" in the index; and the same
>>>>> query
>>>>> "mixedCase" would also match two separate words "mixed Case" in index.
>>>>> (Case insensitively since I apply an ICUFoldingFilter on top of that).
>>>>>
>>>>> Was I wrong, is this not an intended thing for the WDF to do? Or do I
>>>>> just
>>>>> have the wrong configuration options for it to do it? Or is it a bug?
>>>>>
>>>>> When I started this thread a few months ago, I think Erick Erickson
>>>>> agreed
>>>>> this was an intended use case for the WDF, but maybe I explained it
>>>>> poorly.
>>>>> Erick if you're around and want to at least confirm whether WDF is
>>>>> supposed
>>>>> to do this in your understanding, that would be great!
>>>>>
>>>>> Jonathan
>>>>>
>>>>>
>>>>
>>

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Reply via email to