gsmiller commented on PR #11741: URL: https://github.com/apache/lucene/pull/11741#issuecomment-1242553881
> I'm not sure if this is true. I've seen users run TermInSetQuerys with 10k terms or more, a typical use-case being implementing some form of join where a first query collects IDs of interest and a second query uses them as a filter. Right, that's fair. I'm curious about the cases you've seen though. The cost estimate for `TermInSetQuery` is pretty perfectly suited for this use-case right now (assuming the key uniquely identifies a document). Putting cost estimation aside though, I'd be curious when a doc-values approach would be more efficient here. Looking up the 10k terms has to be done for both query approaches, so I'd only expect the doc-values approach to be more efficient if there's a lead iterator with relatively few documents (relative to the number of unique terms in the join). Is that the sort of case you have in mind? > Terms dictionary lookups tend to be expensive, so looking up these 10k terms is not cheap, and if IndexOrDocValuesQuery decides that the doc-values approach is better because of costs, then Lucene will perform these 10k lookups again in the terms dictionary of the doc values field, which would be wasteful? +1. I think we're saying roughly the same thing. The problematic case with doing term lookup as part of cost estimation is duplicating that work when deciding to use doc-values. It's too bad we have to effectively do the same work twice in this case. It would be super cool if we could find a way to reuse the term lookup work, but that's obviously very non-trivial since the term dictionaries are completely different implementations, types of fields, etc. > Not directly related to this discussion, but TermInSetQuery feels more complicated than ranges when it comes to figuring out the best data structure to run the query, and maybe we should fold the logic about whether or not to use doc-values directly into TermInSetQuery instead of expecting users to build an IndexOrDocValuesQuery themselves? This would allow more sophisticated logic, such as making different decisions depending on whether we expect terms to be selective or not? I like this idea a lot actually. I can imagine a fairly different type of heuristic we might want to use as compared to numeric ranges. For instance, the number of terms matters. Whether-or-not we can identify that the field is a unique key matters. Hmmm... seems worth experimenting with a bit. Thanks for the idea! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org