[GitHub] [lucene] gsmiller commented on pull request #11741: DRAFT: Experiment with intersecting TermInSetQuery terms up-front to better estimate cost

GitBox Fri, 09 Sep 2022 16:24:47 -0700


gsmiller commented on PR #11741:
URL: https://github.com/apache/lucene/pull/11741#issuecomment-1242553881


   > I'm not sure if this is true. I've seen users run TermInSetQuerys with 10k 
terms or more, a typical use-case being implementing some form of join where a 
first query collects IDs of interest and a second query uses them as a filter.
   
   Right, that's fair. I'm curious about the cases you've seen though. The cost 
estimate for `TermInSetQuery` is pretty perfectly suited for this use-case 
right now (assuming the key uniquely identifies a document). Putting cost 
estimation aside though, I'd be curious when a doc-values approach would be 
more efficient here. Looking up the 10k terms has to be done for both query 
approaches, so I'd only expect the doc-values approach to be more efficient if 
there's a lead iterator with relatively few documents (relative to the number 
of unique terms in the join). Is that the sort of case you have in mind?
   
   > Terms dictionary lookups tend to be expensive, so looking up these 10k 
terms is not cheap, and if IndexOrDocValuesQuery decides that the doc-values 
approach is better because of costs, then Lucene will perform these 10k lookups 
again in the terms dictionary of the doc values field, which would be wasteful?
   
   +1. I think we're saying roughly the same thing. The problematic case with 
doing term lookup as part of cost estimation is duplicating that work when 
deciding to use doc-values. It's too bad we have to effectively do the same 
work twice in this case. It would be super cool if we could find a way to reuse 
the term lookup work, but that's obviously very non-trivial since the term 
dictionaries are completely different implementations, types of fields, etc.
   
   > Not directly related to this discussion, but TermInSetQuery feels more 
complicated than ranges when it comes to figuring out the best data structure 
to run the query, and maybe we should fold the logic about whether or not to 
use doc-values directly into TermInSetQuery instead of expecting users to build 
an IndexOrDocValuesQuery themselves? This would allow more sophisticated logic, 
such as making different decisions depending on whether we expect terms to be 
selective or not?
   
   I like this idea a lot actually. I can imagine a fairly different type of 
heuristic we might want to use as compared to numeric ranges. For instance, the 
number of terms matters. Whether-or-not we can identify that the field is a 
unique key matters. Hmmm... seems worth experimenting with a bit. Thanks for 
the idea!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #11741: DRAFT: Experiment with intersecting TermInSetQuery terms up-front to better estimate cost

Reply via email to