gsmiller commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1419161514

   @rmuir I grabbed your patch for adding a `ScoreSupplier` to 
`DocValuesTermsQuery` (#12129) and reran benchmarks. The gap between IndexOrDV 
and the "self-optimizing" TermInSetQuery have closed with this change. It looks 
like I was wrong about the way IndexOrDV plans PK-type queries. I thought it 
was choosing to use doc values based on what I saw in profiler output, but what 
I was really seeing was the up-front ordinal lookups in `DocValuesTermsQuery` 
as a result of not having the `ScoreSupplier` abstraction. With your patch, 
that goes away.
   
   The only gap that remains now is when the field is _not_ a PK-style field 
but the terms being used in the disjunction have a low aggregate cost (relative 
to the other terms in the field; e.g., `Medium Cardinality + Low Cost Country 
Code Filter Terms`). In this case, IndexOrDV is always using doc values (due to 
the field-level stats used for cost), but—by doing some term-seeking—we could 
better decide to use postings. 
   
   Here are updated benchmark results: 
[TiSBenchResults_Simplified_DVSSPatch.md.txt](https://github.com/apache/lucene/files/10663766/TiSBenchResults_Simplified_DVSSPatch.md.txt)
   (Note that "low cardinality" cases are kind of terrible still because the 
TiSQuery is being rewritten to a BooleanQuery)
   
   > to me the issue is a problem with TermInSetQuery ScorerSupplier cost method
   
   +1. Maybe there's a way to address this remaining gap by being smarter about 
the cost function without term-seeking? That would be ideal.
   
   I also played around with the idea of a "cost iterator" abstraction on 
`ScoreSupplier` as a way to allow something like `TermInSetQuery` to provide 
incremental costs to `IndexOrDocValuesQuery` as it term-seeks. This feels 
clunky to me, and I'm not proposing it as a "good idea" right now, but I'll 
share it as another approach. I was able to get comparable benchmark results 
with this technique, and it still allows `IndexOrDocValuesQuery` to "own" the 
decision between postings and doc values: 
https://github.com/apache/lucene/compare/main...gsmiller:lucene:explore/tis-score-supplier-cost-iterator.
 Benchmark results for this approach are here: 
[TiSBenchResults_SSIterator.md.txt](https://github.com/apache/lucene/files/10663947/TiSBenchResults_SSIterator.md.txt).
 It feels overly complicated though.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to