bq: In my actual index, query "MacBook" is matching ONLY "mac book", and not "macbook"
I suspect your query parameters for WordDelimiterFilterFactory doesn't have catenate words set. What do you see when you enter these in both the index and query portions of the admin/analysis page? Best, Erick On Tue, Sep 2, 2014 at 10:34 AM, Michael Della Bitta < michael.della.bi...@appinions.com> wrote: > If that's your problem, I bet all you have to do is twiddle on one of the > catenate options, either catenateWords or catenateAll. > > Michael Della Bitta > > Applications Developer > > o: +1 646 532 3062 > > appinions inc. > > “The Science of Influence Marketing” > > 18 East 41st Street > > New York, NY 10017 > > t: @appinions <https://twitter.com/Appinions> | g+: > plus.google.com/appinions > < > https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts > > > w: appinions.com <http://www.appinions.com/> > > > On Tue, Sep 2, 2014 at 1:07 PM, Jonathan Rochkind <rochk...@jhu.edu> > wrote: > > > Thanks for the response. > > > > I understand the problem a little bit better after investigating more. > > > > Posting my full field definitions is, I think, going to be confusing, as > > they are long and complicated. I can narrow it down to an isolation case > if > > I need to. My indexed field in question is relatively short strings. > > > > But what it's got to do with is the WordDelimiterFilter's default > > splitOnCaseChange=1 and generateWordParts=1, and the effects of such. > > > > Let's take a less confusing example, query "MacBook". With a > > WordDelimiterFilter followed by something that downcases everything. > > > > I think what the WDF (followed by case folding) is trying to do is make > > query "MacBook" match both indexed text "mac book" as well as "macbook" > -- > > either one should be a match. Is my understanding right of what > > WordDelimiterfilter with splitOnCaseChange=1 and generateWordParts=1 is > > intending to do? > > > > In my actual index, query "MacBook" is matching ONLY "mac book", and not > > "macbook". Which is unexpected. I indeed want it to match both. (I > realize > > I could make it match only 'macbook' by setting splitOnCaseChange=0 > and/or > > generateWordParts=0). > > > > It's possible this is happening as a side effect of other parts of my > > complex field definition, and I really do need to post hte whole thing > > and/or isolate it. But I wonder if there are known general problem cases > > that cause this kind of failure, or any known bugs in WordDelimiterFilter > > (in Solr 4.3?) that cause this kind of failure. > > > > And I wonder if WordDelimiter filter spitting out the token "MacBook" > with > > position "2" rather than "1" is expected, irrelevant, or possibly a > > relevant problem. > > > > Thanks again, > > > > Jonathan > > > > > > On 9/2/14 12:59 PM, Michael Della Bitta wrote: > > > >> Hi Jonathan, > >> > >> Little confused by this line: > >> > >> And, what I think it's trying to do, is match text indexed as "d > elalain" > >>> > >> as well as text indexed by "delalain". > >> > >> In this case, I don't know how WordDelimiterFilter will help, as you're > >> likely tokenizing on spaces somewhere, and that input text has a space. > I > >> could be wrong. It's probably best if you post your field definition > from > >> your schema. > >> > >> Also, is this a free-text field, or something that's more like a short > >> string? > >> > >> Thanks, > >> > >> > >> Michael Della Bitta > >> > >> Applications Developer > >> > >> o: +1 646 532 3062 > >> > >> appinions inc. > >> > >> “The Science of Influence Marketing” > >> > >> 18 East 41st Street > >> > >> New York, NY 10017 > >> > >> t: @appinions <https://twitter.com/Appinions> | g+: > >> plus.google.com/appinions > >> <https://plus.google.com/u/0/b/112002776285509593336/ > >> 112002776285509593336/posts> > >> w: appinions.com <http://www.appinions.com/> > >> > >> > >> > >> On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind <rochk...@jhu.edu> > >> wrote: > >> > >> Hello, I'm running into a case where a query is not returning the > results > >>> I expect, and I'm hoping someone can offer some explanation that might > >>> help > >>> me fine tune things or understand what's up. > >>> > >>> I am running Solr 4.3. > >>> > >>> My filter chain includes a WordDelimiterFilter and, later a filter that > >>> downcases everything for case-insensitive searching. It includes many > >>> other > >>> things too, but I think these are the pertinent facts. > >>> > >>> For query "dELALAIN", the WordDelimiterFilter splits into: > >>> > >>> text: d > >>> start: 0 > >>> position: 1 > >>> > >>> text: ELALAIN > >>> start: 1 > >>> position: 2 > >>> > >>> text: dELALAIN > >>> start: 0 > >>> position: 2 > >>> > >>> Note the duplication/overlap of the tokens -- one version with "d" and > >>> "ELALAIN" split into two tokens, and another with just one token. > >>> > >>> Later, all the tokens are lowercased by another filter in the chain. > >>> (actually an ICU filter which is doing something more complicated than > >>> just > >>> lowercasing, but I think we can consider it lowercasing for the > purposes > >>> of > >>> this discussion). > >>> > >>> If I understand right what the WordDelimiterFilter is trying to do > here, > >>> it's probably doing something special because of the lowercase "d" > >>> followed > >>> by an uppercase letter, a special case for that. (I don't get this > >>> behavior > >>> with other mixed case queries not beginning with 'd'). > >>> > >>> And, what I think it's trying to do, is match text indexed as "d > elalain" > >>> as well as text indexed by "delalain". > >>> > >>> The problem is, it's not accomplishing that -- it is NOT matching text > >>> that was indexed as "delalain" (one token). > >>> > >>> I don't entirely understand what the "position" attribute is for -- > but I > >>> wonder if in this case, the position on "dELALAIN" is really supposed > to > >>> be > >>> 1, not 2? Could that be responsible for the bug? Or is position > >>> irrelevant in this case? > >>> > >>> If that's not it, then I'm at a loss as to what may be causing this bug > >>> -- > >>> or even if it's a bug at all, or I'm just not understanding intended > >>> behavior. I expect a query for "dELALAIN" to match text indexed as > >>> "delalain" (because of the forced lowercasing in the filter chain). But > >>> it's not doing so. Are my expectations wrong? Bug? Something else? > >>> > >>> Thanks for any advice, > >>> > >>> Jonathan > >>> > >>> > >> >