RE: ampersand, dismax, combining two fields, one of which is keywordTokenizer

Chris Hostetter Tue, 21 Jun 2011 19:20:36 -0700

: not other) setups/intentions.  It's counter-intuitive to me that adding 
: a field to the 'qf' set results in _fewer_ hits than the same 'qf' set


agreed .. but that's where looking the debug info comes in to understand 
the reason for that behavior is that your old qf treated part of your 
input as garbage and that new field respects it and uses it in the 
calculation.

mind you: the "fewer hits" behavior only happens when using a percentage 
value in mm ... if you had mm=2 you'd get more results, but you've asked 
for "66%" (or whatever) and with that new qf there is a differnet number 
of clauses produced by query parsing.

: I wonder if it would be a good idea to have a parameter to (e)dismax 
: that told it which of these two behaviors to use? The one where the 
: 'term count' is based on the maximum number of terms from any field in 
: the 'qf', and one where it's based on the minimum number of terms 
: produced from any field in the qf?  I am still not sure how feasible 

even in your use case, i don't think you are fully considering what that 
would produce.  imagine that an mmType=min param existed and gave you what 
you're asking for.  Now imagine that you have two fields, one named 
"simple" that strips all punctuation and one named "complex" that doesn't, 
and you have a query like this...

        q=Foo & Bar
        qf=simple complex
        mm=100%
        mmType=min

  * Foo produces tokens for all qf
  * & only produces tokens for some qf (complex)
  * Bar products tokens for all qf

your mmType would say "there are only 2 tokens that we can query across 
all fields, so our computed minShouldMatch should be 100% of 2 == 2"

sounds good so far right?

the problem is you still have query clause coming from that "&" 
character ... you have 3 real clauses, one of which is that term query for 
"complex:&" which means that with your (computed) minShouldMatch of 2 you 
would see matches for any doc that happened to have indexed the "&" symbol 
in the "complex" field and also matched *either* of Foo or Bar (in either 
field)

So while a lot of your results would match both Foo and Bar, you'd get 
still get a bunch of weird results.

: Or maybe a feature where you tell dismax, the number of tokens produced 
: by field X, THAT's the one you should use for your 'term count' for mm, 

Hmmm.... maybe.  i'd have to see a patch in action and play with it, to 
really think it through ... hmmm ... honestly i really can't imagine how 
that would be helpful in general...

in order to use a feature like that you'd have to really think hard about 
the query analysis of your fields, and which ones will produce which 
tokens in which situations in order to make sure you pick the *right* 
value for that param -- but once you've done that hard thinking you might 
as well feed it back into your schema.xml and say "the query analyzer for 
field 'complex' should prune any tokens that only contain punctuation" 
(instead of saying "'complex' will produce tokens that only contain 
punctuation, so lets tell dismax to compute mm based only on 'simple').  
Afterall, there might not be one single field that you can pick -- maybe 
'complex' lets tokens that are all punctuation through but strips 
stopwords, and maybe 'simple' does the opposite ... no param value you 
pick will help you with that possibility, you really just need to fix the 
query analyzers to make sense if you want to use both of those two fields 
in the qf.


-Hoss

RE: ampersand, dismax, combining two fields, one of which is keywordTokenizer

Reply via email to