gsmiller commented on PR #12089: URL: https://github.com/apache/lucene/pull/12089#issuecomment-1419161514
@rmuir I grabbed your patch for adding a `ScoreSupplier` to `DocValuesTermsQuery` (#12129) and reran benchmarks. The gap between IndexOrDV and the "self-optimizing" TermInSetQuery have closed with this change. It looks like I was wrong about the way IndexOrDV plans PK-type queries. I thought it was choosing to use doc values based on what I saw in profiler output, but what I was really seeing was the up-front ordinal lookups in `DocValuesTermsQuery` as a result of not having the `ScoreSupplier` abstraction. With your patch, that goes away. The only gap that remains now is when the field is _not_ a PK-style field but the terms being used in the disjunction have a low aggregate cost (relative to the other terms in the field; e.g., `Medium Cardinality + Low Cost Country Code Filter Terms`). In this case, IndexOrDV is always using doc values (due to the field-level stats used for cost), but—by doing some term-seeking—we could better decide to use postings. Here are updated benchmark results: [TiSBenchResults_Simplified_DVSSPatch.md.txt](https://github.com/apache/lucene/files/10663766/TiSBenchResults_Simplified_DVSSPatch.md.txt) (Note that "low cardinality" cases are kind of terrible still because the TiSQuery is being rewritten to a BooleanQuery) > to me the issue is a problem with TermInSetQuery ScorerSupplier cost method +1. Maybe there's a way to address this remaining gap by being smarter about the cost function without term-seeking? That would be ideal. I also played around with the idea of a "cost iterator" abstraction on `ScoreSupplier` as a way to allow something like `TermInSetQuery` to provide incremental costs to `IndexOrDocValuesQuery` as it term-seeks. This feels clunky to me, and I'm not proposing it as a "good idea" right now, but I'll share it as another approach. I was able to get comparable benchmark results with this technique, and it still allows `IndexOrDocValuesQuery` to "own" the decision between postings and doc values: https://github.com/apache/lucene/compare/main...gsmiller:lucene:explore/tis-score-supplier-cost-iterator. Benchmark results for this approach are here: [TiSBenchResults_SSIterator.md.txt](https://github.com/apache/lucene/files/10663947/TiSBenchResults_SSIterator.md.txt). It feels overly complicated though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org