Re: WordDelimiter filter, expanding to multiple words, unexpected results

Walter Underwood Tue, 30 Dec 2014 09:37:03 -0800

You want preserveOriginal=“1”.

You should only do this processing at index time.


wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/


On Dec 30, 2014, at 9:33 AM, Jonathan Rochkind <rochk...@jhu.edu> wrote:

> Okay, thanks. I'm not sure if it's my lack of understanding, but I feel like 
> I'm having a very hard time getting straight answers out of you all, here.
> 
> I want the query "mixedCase" to match both/either "mixed Case" and 
> "mixedCase" in the index.
> 
> What configuration of WDF at index/query time would do this?
> 
> This isn't neccesarily the only thing I want WDF to do, but it's something I 
> want it to do and thought it was doing and found out it wasn't. So we can 
> isolate/simplify to there -- if I can figure out what WDF configuration (if 
> any?) can do that first, then I can always move on to figuring out how/if 
> that impacts the other things I want WDF to do.
> 
> So is there a WDF configuration that can do that? Or is the problem that it's 
> confusing, and none of you all are sure either if there is what it would be, 
> it's not clear?
> 
> Jonathan
> 
> On 12/30/14 12:02 PM, Jack Krupansky wrote:
>> I do have a more thorough discussion of WDF in my Solr Deep Dive e-book:
>> http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html
>> 
>> You're not "wrong" about anything here... you just need to accept that WDF
>> is not magic and can't handle every use can that anybody can imagine.
>> 
>> And you do need to be careful about interactions between the query parser
>> and the analyzers, especially in these kinds of cases where a single term
>> might generate multiple terms.
>> 
>> Some of these features really are only suitable for advanced, "expert"
>> users.
>> 
>> Note that one of the features that Solr is missing is support for the
>> Google-like feature of splitting concatenated words (regardless of case.)
>> That's worthy of a Jira.
>> 
>> 
>> -- Jack Krupansky
>> 
>> On Tue, Dec 30, 2014 at 11:44 AM, Jonathan Rochkind <rochk...@jhu.edu>
>> wrote:
>> 
>>> I guess I don't understand what the four use cases are, or the three out
>>> of four use cases, or whatever. What the intended uses of the WDF are.
>>> 
>>> Can you explain what the intended use of setting:
>>> 
>>> generateWordParts="1" catenateWords="1" splitOnCaseChange="1"
>>> 
>>> Is that supposed to do something useful (at either query or index time),
>>> or is that a nonsensical configuration that nobody should ever use?
>>> 
>>> I understand how analysis can be different at index vs query time. I think
>>> what I don't fully understand is what the possibilities and intended use
>>> case of the WDF are, with various configurations.
>>> 
>>> I thought one of the intended use cases, with appropriate configuration,
>>> was to do what I'm talking: allow "mixedCase" query to match both "mixed
>>> Case" and "mixed Case" in the index. I think you're saying I'm wrong, and
>>> this is not something WDF can do? Can you confirm I understand you right?
>>> 
>>> Thanks!
>>> 
>>> Jonathan
>>> 
>>> 
>>> On 12/30/14 11:30 AM, Jack Krupansky wrote:
>>> 
>>>> Right, that's what I meant by WDF not being "magic" - you can configure it
>>>> to match any three out of four use cases as you choose, but there is no
>>>> choice that matches all of the use cases.
>>>> 
>>>> To be clear, this is not a "bug" in WDF, but simply a limitation.
>>>> 
>>>> 
>>>> -- Jack Krupansky
>>>> 
>>>> On Tue, Dec 30, 2014 at 11:12 AM, Jonathan Rochkind <rochk...@jhu.edu>
>>>> wrote:
>>>> 
>>>>  Thanks Erick!
>>>>> 
>>>>> Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then
>>>>> query for "mixedCase" will no longer also match "mixed Case".
>>>>> 
>>>>> I think I want WDF to... kind of do all of the above.
>>>>> 
>>>>> Specifically, I had thought that it would allow a query for "mixedCase"
>>>>> to
>>>>> match both/either "mixed Case" or "mixedCase" in the index. (with case
>>>>> insensitivity on top of that via another filter).
>>>>> 
>>>>> That would support things like names like "duBois" which are sometimes
>>>>> spelled "du bois" and sometimes "dubois", and allow the query "duBois" to
>>>>> match both in the index.
>>>>> 
>>>>> I had somehow thought that was what WDF was intended for. But it's
>>>>> actually not the usual functioning, and may not be realistic?
>>>>> 
>>>>> I'm a bit confused about what splitOnCaseChange combined with
>>>>> catenateWords is meant to do at all.  It _is_ generating both the split
>>>>> and
>>>>> single-word tokens at query time -- but not in a way that actually allows
>>>>> it to match both the split and single-word tokens?  What is supposed to
>>>>> be
>>>>> the purpose/use case for splitOnCaseChange with catenateWords? If any?
>>>>> 
>>>>> Jonathan
>>>>> 
>>>>> 
>>>>> On 12/29/14 7:20 PM, Erick Erickson wrote:
>>>>> 
>>>>>  Jonathan:
>>>>>> 
>>>>>> Well, it works if you set splitOnCaseChange="0" in just the query part
>>>>>> of the analysis chain. I probably mislead you a bit months ago, WDFF
>>>>>> is intended for this case iff you expect the case change to generate
>>>>>> _tokens_ that are individually meaningful.. And unfortunately
>>>>>> "significant" in one case will be not-significant in others.
>>>>>> 
>>>>>> So what kinds of things do you want WDFF to handle? Case changes?
>>>>>> Letter/non-letter transitions? All of the above?
>>>>>> 
>>>>>> Best,
>>>>>> Erick
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind <rochk...@jhu.edu>
>>>>>> wrote:
>>>>>> 
>>>>>>  On 12/29/14 5:24 PM, Jack Krupansky wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> WDF is powerful, but it is not magic. In general, the indexed data is
>>>>>>>> expected to be clean while the query might be sloppy. You need to
>>>>>>>> separate
>>>>>>>> the index and query analyzers and they need to respect that
>>>>>>>> distinction
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> I do not understand what separate query/index analysis you are
>>>>>>> suggesting to
>>>>>>> accomplish what I wanted.
>>>>>>> 
>>>>>>> I understand the WDF, like all software, is not magic, of course. But I
>>>>>>> thought this was an intended use case of the WDF, with those settings:
>>>>>>> 
>>>>>>> A "mixedCase" query would match "mixedCase" in the index; and the same
>>>>>>> query
>>>>>>> "mixedCase" would also match two separate words "mixed Case" in index.
>>>>>>> (Case insensitively since I apply an ICUFoldingFilter on top of that).
>>>>>>> 
>>>>>>> Was I wrong, is this not an intended thing for the WDF to do? Or do I
>>>>>>> just
>>>>>>> have the wrong configuration options for it to do it? Or is it a bug?
>>>>>>> 
>>>>>>> When I started this thread a few months ago, I think Erick Erickson
>>>>>>> agreed
>>>>>>> this was an intended use case for the WDF, but maybe I explained it
>>>>>>> poorly.
>>>>>>> Erick if you're around and want to at least confirm whether WDF is
>>>>>>> supposed
>>>>>>> to do this in your understanding, that would be great!
>>>>>>> 
>>>>>>> Jonathan
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Reply via email to