romseygeek opened a new issue, #12483: URL: https://github.com/apache/lucene/issues/12483
### Description I've run across an interestingly adversarial setup for the new TermsQuery implementation. We have an index that is using a block-join structure, with the parent document being a merchant record and the children being individual transactions. Doing a range query filtered by some specific merchant ids turns out to be surprisingly slow. The range query in particular takes a long time, because it covers a large portion of the document space but can't use some of our shortcut heuristics (eg LUCENE-7641 that will invert the search to find docs that *don't* match) because the parent docs don't have the timestamp field. In combination with the filter, though, I would have expected things to still be quick, because the range query is using IndexOrDocValuesQuery and so a filter that narrows the search down to a fraction of the index ought to select the doc-by-doc checking path. However, it turns out that the cost estimation code in AbstractMultiTermQueryConstantScoreWrapper will calcul ate a very large cost if the field you're filtering on isn't an ID field - in this case, we have a sort of mid-level cardinality where each value represents a few percent of the index, which ends up yielding a cost estimate of the total number of docs in the index minus 100 or so. Explicitly using a boolean disjunction instead of a TermsQuery yields a much more accurate cost estimate and correctly selects doc-by-doc range checking, giving a much more performant query. We used to automatically rewrite TermsQuery to a simple boolean disjunction if there were fewer than 16 terms. I wonder if the more complex machinery we are using now is overkill for these small term sets, and we should just go back to this simple rewrite in those cases? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org