Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer

Jonathan Rochkind Wed, 15 Jun 2011 13:41:01 -0700

Thanks. I'm trying to think through if there's any hypothetical way fordismax to be improved to not be subject to this problem. Now that it'sclear that the problem isn't just with stopwords, and that in fact it'svery hard to predict if you'll get the problem and under what input,when creating your schema and 'qf' list.... it seems a worse problemthan it did when it was thought of as just stopwords-related.

Of course, I'm trying to think through this without actuallyunderstanding the dismax code at all, just based on what I know of howdismax works from black box observation.

It seems like the problem is when different fields in the 'qf' produce adifferent number of tokens for a given query. dismax needs to know thenumber of tokens in the input in order to calculate 'mm', when 'mm' isexpressed as a percentage, or when different mm's are given fordifferent numbers of input tokens.

Somehow dismax gets at this number now, based on the actual fieldanalysis, not just whitespace-splitting at the query parser level.Because if I issue query "roosevelt & churchill", and ALL the fieldsinvolved have analysis that turns this into just two tokens['roosevlet', 'churchill'], then dismax does the right thing,recognizing two terms in the input. The problem is when some of thefields produce two tokens from that input, and others produce three ---dismax, I think, then decides there are three terms in input, but in atleast some fields those 'three' terms can't possibly all match.

So what if dismax could recognize that different fields were producingdifferent arrity of input, and use the _smallest_ number for it's 'mm'calculations, instead of current behavior where it's effectively thelargest number? (Or '1' if the smallest number is '0'?!) That would insome cases produce errors in the other direction -- more hits comingback than you naively/intuitively expect. Not sure if that would beworse or better. Seems better to me, less bad failure mode.

Or better yet, but surely harder perhaps infeasible to code, it wouldsomehow apply the 'mm' differently to each field. Not even sure whatthat means exactly. But somehow an mm of 100% means two terms in thefield that analysis to 2 OR three terms in the field that analyses to3... man, that's a mess. Okay, stick with the first idea.

But I've got no idea how feasible that is to code, and I personally haveno time to figure out how to code it, and nobody else is likely to sincethis problem is unlikely to be a high priority for solr committers....so, I dunno.


On 6/15/2011 3:46 PM, Erick Erickson wrote:

Jonathan:

Thanks for writing that up, you're right, it is arcane....

I've starred this one!

Erick

http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-td493483.html
http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/

So to understand, first familiarize yourself with that.

However, none of the fields involved here had any stopwords at all, so at
first it wasn't obvious this was the problem. But having different
tokenization and other analysis between fields can result in exactly the
same problem, for certain queries.

One field in the dismax qf used an analyzer that stripped punctuation. (I'm
actually not positive at this point _which_ analyzer in my chain was
stripping punctuation, I'm using a bunch including some custom ones, but I
was aware that punctuation was being stripped, this was intentional.)

So "monkey's" turns into "monkey".  "monkey:" turns into "monkey".  So far
so good. But what happens if you have punctuation all by itself seperated by
whitespace?  "Roosevlet&  Churchill" turns into ['roosevelt', 'churchill'].
  That ampersand in the middle was stripped out, essentially _just as if_ it
were a stopword. Only two tokens result from that input.

You can see where this is going -- another field involved in the dismax qf
did NOT strip out punctuation. So three tokens result from that input,
['Roosevelt', '&', 'Churchill'].

Now we have exactly the situation that gives ride the dismax stopwords
mm-behaving-funny situation, it's exactly the same thing.

Now I've fixed this for punctuation just by making those fields strip out
punctuation, by adding these analyzers to the bottom of those
previously-not-stripping-punctuation field definitions:

<!-- strip punctuation, to avoid dismax stopwords-like mm bug -->
<filter class="solr.PatternReplaceFilterFactory"
                pattern="([\p{Punct}])" replacement="" replace="all"
        />
<!-- if after stripping punc we have any 0-length tokens, make
              sure to eliminate them. We can use LengthFilter min=1 for that,
              we dont' care about the max here, just a very large number. -->
<filter class="solr.LengthFilterFactory" min="1" max="100"/>


And things are working are how I expect again, at least for this punctuation
issue. But there may be other edge cases where differences in analysis
result in different number of tokens from different fields, which if they
are both included in a dismax qf, will have bad effects on 'mm'.

The lesson I think, is that the only absolute safe way to use dismax 'mm',
is when all fields in the 'qf' have exactly the same analysis.  But
obviously that's not very practical, it destroys much of the power of
dismax. And some differences in analysis are certainly acceptable -- but
it's rather tricky to figure out if your differences in analysis are going
to be significant for this problem, under what input, and if so fix them. It
is not an easy thing to do.  So dismax definitely has this gotcha
potentially waiting for you, whenever mixing fields with different analysis
in a 'qf'.


On 6/14/2011 5:25 PM, Jonathan Rochkind wrote:

Okay, let's try the debug trace again without a pf to be less confusing.

One field in qf, that's ordinary text tokenized, and does get hits:


q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t&mm=100%&debugQuery=true&pf=

<str name="rawquerystring">churchill : roosevelt</str>
<str name="querystring">churchill : roosevelt</str>
<str name="parsedquery">
+((DisjunctionMaxQuery((title1_t:churchil)~0.01)
DisjunctionMaxQuery((title1_t:roosevelt)~0.01))~2) ()
</str>
<str name="parsedquery_toString">
+(((title1_t:churchil)~0.01 (title1_t:roosevelt)~0.01)~2) ()
</str>

And that gets 25 hits. Now we add in a second field to the qf, this second
field is also ordinarily tokenized. We expect no _fewer_ than 25 hits,
adding another field into qf, right? And indeed it still results in exactly
25 hits (no additional hits from the additional qf field).


?q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t%20title2_t&mm=100%&debugQuery=true&pf=

<str name="parsedquery">
+((DisjunctionMaxQuery((title2_t:churchil | title1_t:churchil)~0.01)
DisjunctionMaxQuery((title2_t:roosevelt | title1_t:roosevelt)~0.01))~2) ()
</str>
<str name="parsedquery_toString">
+(((title2_t:churchil | title1_t:churchil)~0.01 (title2_t:roosevelt |
title1_t:roosevelt)~0.01)~2) ()
</str>



Okay, now we go back to just that first (ordinarily tokenized) field, but
add a second field in that uses KeywordTokenizerFactory.  We expect this not
neccesarily to ever match for a multi-word query, but we don't expect it to
be fewer than 25 hits, the 25 hits from the first field in the qf should
still be there, right? But it's not. What happened, why not?


q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t%20isbn_t&mm=100%&debugQuery=true&pf=


str name="rawquerystring">churchill : roosevelt</str>
<str name="querystring">churchill : roosevelt</str>
<str name="parsedquery">+((DisjunctionMaxQuery((isbn_t:churchill |
title1_t:churchil)~0.01) DisjunctionMaxQuery((isbn_t::)~0.01)
DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3)
()</str>
<str name="parsedquery_toString">+(((isbn_t:churchill |
title1_t:churchil)~0.01 (isbn_t::)~0.01 (isbn_t:roosevelt |
title1_t:roosevelt)~0.01)~3) ()</str>



On 6/14/2011 5:19 PM, Jonathan Rochkind wrote:

I'm aware that using a field tokenized with KeywordTokenizerFactory is in
a dismax 'qf' is often going to result in 0 hits on that field -- (when a
whitespace-containing query is entered).  But I do it anyway, for cases
where a non-whitespace-containing query is entered, then it hits.  And in
those cases where it doesn't hit, I figure okay, well, the other fields in
qf will hit or not, that's good enough.

And usually that works. But it works _differently_ when my query contains
an ampersand (or any other punctuation), result in 0 hits when it shoudln't,
and I can't figure out why.

basically,

&defType=dismax&mm=100%&q=one : two&qf=text_field

gets hits.  The ":" is thrown out the text_field, but the mm still passes
somehow, right?

But, in the same index:

&defType=dismax&mm=100%&q=one : two&qf=text_field
keyword_tokenized_text_field

gets 0 hits.  Somehow maybe the inclusion of the
keyword_tokenized_text_field in the qf causes dismax to calculate the mm
differently, decide there are three tokens in there and they all must match,
and the token ":" can never match because it's not in my index it's stripped
out... but somehow this isn't a problem unless I include a keyword-tokenized
  field in the qf?

This is really confusing, if anyone has any idea what I'm talking about
it and can shed any light on it, much appreciated.

The conclusion I am reaching is just NEVER include anything but a more or
less ordinarily tokenized field in a dismax qf. Sadly, it was useful for
certain use cases for me.

Oh, hey, the debugging trace woudl probably be useful:


<lstname="debug">
<strname="rawquerystring">
churchill : roosevelt
</str>
<strname="querystring">
churchill : roosevelt
</str>
<strname="parsedquery">
+((DisjunctionMaxQuery((isbn_t:churchill | title1_t:churchil)~0.01)
DisjunctionMaxQuery((isbn_t::)~0.01) DisjunctionMaxQuery((isbn_t:roosevelt |
title1_t:roosevelt)~0.01))~3) DisjunctionMaxQuery((title2_unstem:"churchill
roosevelt"~3^240.0 | text:"churchil roosevelt"~3^10.0 | title2_t:"churchil
roosevelt"~3^50.0 | author_unstem:"churchill roosevelt"~3^400.0 |
title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchil
roosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 |
author2_unstem:"churchill roosevelt"~3^240.0 | title3_unstem:"churchill
roosevelt"~3^80.0 | subject_t:"churchil roosevelt"~3^10.0 |
other_number_unstem:"churchill roosevelt"~3^40.0 | subject_unstem:"churchill
roosevelt"~3^80.0 | title_series_t:"churchil roosevelt"~3^40.0 |
title_series_unstem:"churchill roosevelt"~3^60.0 | text_unstem:"churchill
roosevelt"~3^80.0)~0.01)
</str>
<strname="parsedquery_toString">
+(((isbn_t:churchill | title1_t:churchil)~0.01 (isbn_t::)~0.01
(isbn_t:roosevelt | title1_t:roosevelt)~0.01)~3) (title2_unstem:"churchill
roosevelt"~3^240.0 | text:"churchil roosevelt"~3^10.0 | title2_t:"churchil
roosevelt"~3^50.0 | author_unstem:"churchill roosevelt"~3^400.0 |
title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchil
roosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 |
author2_unstem:"churchill roosevelt"~3^240.0 | title3_unstem:"churchill
roosevelt"~3^80.0 | subject_t:"churchil roosevelt"~3^10.0 |
other_number_unstem:"churchill roosevelt"~3^40.0 | subject_unstem:"churchill
roosevelt"~3^80.0 | title_series_t:"churchil roosevelt"~3^40.0 |
title_series_unstem:"churchill roosevelt"~3^60.0 | text_unstem:"churchill
roosevelt"~3^80.0)~0.01
</str>
<lstname="explain"/>
<strname="QParser">
DisMaxQParser
</str>
<nullname="altquerystring"/>
<nullname="boostfuncs"/>
<lstname="timing">
<doublename="time">
6.0
</double>
<lstname="prepare">
<doublename="time">
3.0
</double>
<lstname="org.apache.solr.handler.component.QueryComponent">
<doublename="time">
2.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.FacetComponent">
<doublename="time">
0.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.MoreLikeThisComponent">
<doublename="time">
0.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.HighlightComponent">
<doublename="time">
0.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.StatsComponent">
<doublename="time">
0.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.SpellCheckComponent">
<doublename="time">
0.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.DebugComponent">
<doublename="time">
0.0
</double>
</lst>
</lst>

Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer

Reply via email to