romseygeek opened a new issue, #12483:
URL: https://github.com/apache/lucene/issues/12483

   ### Description
   
   I've run across an interestingly adversarial setup for the new TermsQuery 
implementation.  We have an index that is using a block-join structure, with 
the parent document being a merchant record and the children being individual 
transactions.  Doing a range query filtered by some specific merchant ids turns 
out to be surprisingly slow.  The range query in particular takes a long time, 
because it covers a large portion of the document space but can't use some of 
our shortcut heuristics (eg LUCENE-7641 that will invert the search to find 
docs that *don't* match) because the parent docs don't have the timestamp 
field.  In combination with the filter, though, I would have expected things to 
still be quick, because the range query is using IndexOrDocValuesQuery and so a 
filter that narrows the search down to a fraction of the index ought to select 
the doc-by-doc checking path.  However, it turns out that the cost estimation 
code in AbstractMultiTermQueryConstantScoreWrapper will calcul
 ate a very large cost if the field you're filtering on isn't an ID field - in 
this case, we have a sort of mid-level cardinality where each value represents 
a few percent of the index, which ends up yielding a cost estimate of the total 
number of docs in the index minus 100 or so.  Explicitly using a boolean 
disjunction instead of a TermsQuery yields a much more accurate cost estimate 
and correctly selects doc-by-doc range checking, giving a much more performant 
query.
   
   We used to automatically rewrite TermsQuery to a simple boolean disjunction 
if there were fewer than 16 terms.  I wonder if the more complex machinery we 
are using now is overkill for these small term sets, and we should just go back 
to this simple rewrite in those cases?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to