[GitHub] [lucene] gsmiller commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

via GitHub Tue, 07 Feb 2023 09:08:42 -0800


gsmiller commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1421121653


   @rmuir thanks for the feedback and spending time having a look. I'm going to 
try summarizing where we've landed to make sure we're on the same page. I think 
we both agree on the following, but please let me know if I'm missing anything?
   1. Ideally we want to leverage `IndexOrDocValues` instead of putting custom 
solutions into queries themselves (e.g., we'd prefer not to have query 
implementations decide between postings / doc values). This decoupling allows 
more reuse and separate innovations in the separate queries (e.g., we can 
independently optimize `TermInSetQuery` and `DocValuesTermsQuery`). This also 
provides more reuse across various use-cases, as opposed to lots of different 
query implementations having to re-invent this postings vs. doc values logic.
   2. The only inputs `IndexOrDocValues` has for for making a decision are, 1) 
"lead cost" and 2) the stated "cost" of the wrapper query `ScoreSupplier`s. 
`TermInSetQuery` may significantly over-estimate cost in its currently 
implementation as it guarantees a cost ceiling, only looking at field-level 
statistics.
   3. "Primary key" type cases work well with `IndexOrDocValues` since 
`TermInSetQuery` cost is able to recognize this situation and provide a better 
cost estimate.
   4. The case where `IndexOrDocValues` doesn't do a great job is when the 
indexed terms cover many documents in general, but the terms used in the 
specific query cover few.
   
   Given this, I tend to agree that moving away from `IndexOrDocValues` to only 
solve one type of scenario (mentioned in `#4` above), probably isn't the right 
decision. To move forward, I'm going to explore other ways to solve this case 
while relying on the `IndexOrDocValues` abstraction. I can think of a couple 
ways to approach the problem:
   1. Alter the way we compute `TermInSetQuery#cost`. Instead of providing a 
ceiling, we could look at some sort of "average" cost (e.g., determine average 
number of docs per term and multiply that out by the number of terms in the 
query). We could also term-seek up to some fixed number of terms up-front to 
get more information. This obviously ads cost but may be worth it
   2. While significantly more complex, if it's not sufficient to use 
field-level statistics and a single cost value, I may further explore the idea 
of a "cost iterator" in `ScoreSupplier`. This still feels too complex to me and 
I don't really want to go in this direction.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

Reply via email to