gsmiller commented on PR #12135: URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424470476
@rmuir The use case where I've seen this in the wild has to to with allow/deny lists. We have some use-cases where we only want to match documents that exist in some allow-list. That allow-list can be quite large (potentially tens of thousands), but many of the terms aren't present in a given index we're searching. We use the bloom filter codec to efficiently drop terms not present. So we have a large number of terms we need to initialize our `TermInSetQuery` with, but a much smaller number of them actually end up term-seeking, etc., so the sorting actually appears to dominate when we've profiled. I've "redacted" a bunch of this flame chart since this was on an internal system, but you can see how long we're spending sorting terms vs. everything else here: <img width="1488" alt="Screen Shot 2022-10-21 at 4 05 03 PM" src="https://user-images.githubusercontent.com/16479560/217875948-854560b2-59d1-44d7-99fa-cde87ceef4c5.png"> Deny-listing can obviously have the same issue. I believe `TermInSetQuery` was originally created to handle deny-lists in tinder search where there can be tens of thousands of "swipe left" profiles that need to be excluded from results? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org