: not other) setups/intentions. It's counter-intuitive to me that adding
: a field to the 'qf' set results in _fewer_ hits than the same 'qf' set
agreed .. but that's where looking the debug info comes in to understand
the reason for that behavior is that your old qf treated part of your
input as garbage and that new field respects it and uses it in the
calculation.
mind you: the "fewer hits" behavior only happens when using a percentage
value in mm ... if you had mm=2 you'd get more results, but you've asked
for "66%" (or whatever) and with that new qf there is a differnet number
of clauses produced by query parsing.
: I wonder if it would be a good idea to have a parameter to (e)dismax
: that told it which of these two behaviors to use? The one where the
: 'term count' is based on the maximum number of terms from any field in
: the 'qf', and one where it's based on the minimum number of terms
: produced from any field in the qf? I am still not sure how feasible
even in your use case, i don't think you are fully considering what that
would produce. imagine that an mmType=min param existed and gave you what
you're asking for. Now imagine that you have two fields, one named
"simple" that strips all punctuation and one named "complex" that doesn't,
and you have a query like this...
q=Foo & Bar
qf=simple complex
mm=100%
mmType=min
* Foo produces tokens for all qf
* & only produces tokens for some qf (complex)
* Bar products tokens for all qf
your mmType would say "there are only 2 tokens that we can query across
all fields, so our computed minShouldMatch should be 100% of 2 == 2"
sounds good so far right?
the problem is you still have query clause coming from that "&"
character ... you have 3 real clauses, one of which is that term query for
"complex:&" which means that with your (computed) minShouldMatch of 2 you
would see matches for any doc that happened to have indexed the "&" symbol
in the "complex" field and also matched *either* of Foo or Bar (in either
field)
So while a lot of your results would match both Foo and Bar, you'd get
still get a bunch of weird results.
: Or maybe a feature where you tell dismax, the number of tokens produced
: by field X, THAT's the one you should use for your 'term count' for mm,
Hmmm.... maybe. i'd have to see a patch in action and play with it, to
really think it through ... hmmm ... honestly i really can't imagine how
that would be helpful in general...
in order to use a feature like that you'd have to really think hard about
the query analysis of your fields, and which ones will produce which
tokens in which situations in order to make sure you pick the *right*
value for that param -- but once you've done that hard thinking you might
as well feed it back into your schema.xml and say "the query analyzer for
field 'complex' should prune any tokens that only contain punctuation"
(instead of saying "'complex' will produce tokens that only contain
punctuation, so lets tell dismax to compute mm based only on 'simple').
Afterall, there might not be one single field that you can pick -- maybe
'complex' lets tokens that are all punctuation through but strips
stopwords, and maybe 'simple' does the opposite ... no param value you
pick will help you with that possibility, you really just need to fix the
query analyzers to make sense if you want to use both of those two fields
in the qf.
-Hoss