gsmiller commented on PR #12089: URL: https://github.com/apache/lucene/pull/12089#issuecomment-1421121653
@rmuir thanks for the feedback and spending time having a look. I'm going to try summarizing where we've landed to make sure we're on the same page. I think we both agree on the following, but please let me know if I'm missing anything? 1. Ideally we want to leverage `IndexOrDocValues` instead of putting custom solutions into queries themselves (e.g., we'd prefer not to have query implementations decide between postings / doc values). This decoupling allows more reuse and separate innovations in the separate queries (e.g., we can independently optimize `TermInSetQuery` and `DocValuesTermsQuery`). This also provides more reuse across various use-cases, as opposed to lots of different query implementations having to re-invent this postings vs. doc values logic. 2. The only inputs `IndexOrDocValues` has for for making a decision are, 1) "lead cost" and 2) the stated "cost" of the wrapper query `ScoreSupplier`s. `TermInSetQuery` may significantly over-estimate cost in its currently implementation as it guarantees a cost ceiling, only looking at field-level statistics. 3. "Primary key" type cases work well with `IndexOrDocValues` since `TermInSetQuery` cost is able to recognize this situation and provide a better cost estimate. 4. The case where `IndexOrDocValues` doesn't do a great job is when the indexed terms cover many documents in general, but the terms used in the specific query cover few. Given this, I tend to agree that moving away from `IndexOrDocValues` to only solve one type of scenario (mentioned in `#4` above), probably isn't the right decision. To move forward, I'm going to explore other ways to solve this case while relying on the `IndexOrDocValues` abstraction. I can think of a couple ways to approach the problem: 1. Alter the way we compute `TermInSetQuery#cost`. Instead of providing a ceiling, we could look at some sort of "average" cost (e.g., determine average number of docs per term and multiply that out by the number of terms in the query). We could also term-seek up to some fixed number of terms up-front to get more information. This obviously ads cost but may be worth it 2. While significantly more complex, if it's not sufficient to use field-level statistics and a single cost value, I may further explore the idea of a "cost iterator" in `ScoreSupplier`. This still feels too complex to me and I don't really want to go in this direction. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org