You want preserveOriginal=“1”. You should only do this processing at index time.
wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ On Dec 30, 2014, at 9:33 AM, Jonathan Rochkind <rochk...@jhu.edu> wrote: > Okay, thanks. I'm not sure if it's my lack of understanding, but I feel like > I'm having a very hard time getting straight answers out of you all, here. > > I want the query "mixedCase" to match both/either "mixed Case" and > "mixedCase" in the index. > > What configuration of WDF at index/query time would do this? > > This isn't neccesarily the only thing I want WDF to do, but it's something I > want it to do and thought it was doing and found out it wasn't. So we can > isolate/simplify to there -- if I can figure out what WDF configuration (if > any?) can do that first, then I can always move on to figuring out how/if > that impacts the other things I want WDF to do. > > So is there a WDF configuration that can do that? Or is the problem that it's > confusing, and none of you all are sure either if there is what it would be, > it's not clear? > > Jonathan > > On 12/30/14 12:02 PM, Jack Krupansky wrote: >> I do have a more thorough discussion of WDF in my Solr Deep Dive e-book: >> http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html >> >> You're not "wrong" about anything here... you just need to accept that WDF >> is not magic and can't handle every use can that anybody can imagine. >> >> And you do need to be careful about interactions between the query parser >> and the analyzers, especially in these kinds of cases where a single term >> might generate multiple terms. >> >> Some of these features really are only suitable for advanced, "expert" >> users. >> >> Note that one of the features that Solr is missing is support for the >> Google-like feature of splitting concatenated words (regardless of case.) >> That's worthy of a Jira. >> >> >> -- Jack Krupansky >> >> On Tue, Dec 30, 2014 at 11:44 AM, Jonathan Rochkind <rochk...@jhu.edu> >> wrote: >> >>> I guess I don't understand what the four use cases are, or the three out >>> of four use cases, or whatever. What the intended uses of the WDF are. >>> >>> Can you explain what the intended use of setting: >>> >>> generateWordParts="1" catenateWords="1" splitOnCaseChange="1" >>> >>> Is that supposed to do something useful (at either query or index time), >>> or is that a nonsensical configuration that nobody should ever use? >>> >>> I understand how analysis can be different at index vs query time. I think >>> what I don't fully understand is what the possibilities and intended use >>> case of the WDF are, with various configurations. >>> >>> I thought one of the intended use cases, with appropriate configuration, >>> was to do what I'm talking: allow "mixedCase" query to match both "mixed >>> Case" and "mixed Case" in the index. I think you're saying I'm wrong, and >>> this is not something WDF can do? Can you confirm I understand you right? >>> >>> Thanks! >>> >>> Jonathan >>> >>> >>> On 12/30/14 11:30 AM, Jack Krupansky wrote: >>> >>>> Right, that's what I meant by WDF not being "magic" - you can configure it >>>> to match any three out of four use cases as you choose, but there is no >>>> choice that matches all of the use cases. >>>> >>>> To be clear, this is not a "bug" in WDF, but simply a limitation. >>>> >>>> >>>> -- Jack Krupansky >>>> >>>> On Tue, Dec 30, 2014 at 11:12 AM, Jonathan Rochkind <rochk...@jhu.edu> >>>> wrote: >>>> >>>> Thanks Erick! >>>>> >>>>> Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then >>>>> query for "mixedCase" will no longer also match "mixed Case". >>>>> >>>>> I think I want WDF to... kind of do all of the above. >>>>> >>>>> Specifically, I had thought that it would allow a query for "mixedCase" >>>>> to >>>>> match both/either "mixed Case" or "mixedCase" in the index. (with case >>>>> insensitivity on top of that via another filter). >>>>> >>>>> That would support things like names like "duBois" which are sometimes >>>>> spelled "du bois" and sometimes "dubois", and allow the query "duBois" to >>>>> match both in the index. >>>>> >>>>> I had somehow thought that was what WDF was intended for. But it's >>>>> actually not the usual functioning, and may not be realistic? >>>>> >>>>> I'm a bit confused about what splitOnCaseChange combined with >>>>> catenateWords is meant to do at all. It _is_ generating both the split >>>>> and >>>>> single-word tokens at query time -- but not in a way that actually allows >>>>> it to match both the split and single-word tokens? What is supposed to >>>>> be >>>>> the purpose/use case for splitOnCaseChange with catenateWords? If any? >>>>> >>>>> Jonathan >>>>> >>>>> >>>>> On 12/29/14 7:20 PM, Erick Erickson wrote: >>>>> >>>>> Jonathan: >>>>>> >>>>>> Well, it works if you set splitOnCaseChange="0" in just the query part >>>>>> of the analysis chain. I probably mislead you a bit months ago, WDFF >>>>>> is intended for this case iff you expect the case change to generate >>>>>> _tokens_ that are individually meaningful.. And unfortunately >>>>>> "significant" in one case will be not-significant in others. >>>>>> >>>>>> So what kinds of things do you want WDFF to handle? Case changes? >>>>>> Letter/non-letter transitions? All of the above? >>>>>> >>>>>> Best, >>>>>> Erick >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind <rochk...@jhu.edu> >>>>>> wrote: >>>>>> >>>>>> On 12/29/14 5:24 PM, Jack Krupansky wrote: >>>>>>> >>>>>>> >>>>>>>> WDF is powerful, but it is not magic. In general, the indexed data is >>>>>>>> expected to be clean while the query might be sloppy. You need to >>>>>>>> separate >>>>>>>> the index and query analyzers and they need to respect that >>>>>>>> distinction >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> I do not understand what separate query/index analysis you are >>>>>>> suggesting to >>>>>>> accomplish what I wanted. >>>>>>> >>>>>>> I understand the WDF, like all software, is not magic, of course. But I >>>>>>> thought this was an intended use case of the WDF, with those settings: >>>>>>> >>>>>>> A "mixedCase" query would match "mixedCase" in the index; and the same >>>>>>> query >>>>>>> "mixedCase" would also match two separate words "mixed Case" in index. >>>>>>> (Case insensitively since I apply an ICUFoldingFilter on top of that). >>>>>>> >>>>>>> Was I wrong, is this not an intended thing for the WDF to do? Or do I >>>>>>> just >>>>>>> have the wrong configuration options for it to do it? Or is it a bug? >>>>>>> >>>>>>> When I started this thread a few months ago, I think Erick Erickson >>>>>>> agreed >>>>>>> this was an intended use case for the WDF, but maybe I explained it >>>>>>> poorly. >>>>>>> Erick if you're around and want to at least confirm whether WDF is >>>>>>> supposed >>>>>>> to do this in your understanding, that would be great! >>>>>>> >>>>>>> Jonathan >>>>>>> >>>>>>> >>>>>> >>>> >>