gsmiller commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424470476

   @rmuir The use case where I've seen this in the wild has to to with 
allow/deny lists. We have some use-cases where we only want to match documents 
that exist in some allow-list. That allow-list can be quite large (potentially 
tens of thousands), but many of the terms aren't present in a given index we're 
searching. We use the bloom filter codec to efficiently drop terms not present. 
So we have a large number of terms we need to initialize our `TermInSetQuery` 
with, but a much smaller number of them actually end up term-seeking, etc., so 
the sorting actually appears to dominate when we've profiled.
   
   I've "redacted" a bunch of this flame chart since this was on an internal 
system, but you can see how long we're spending sorting terms vs. everything 
else here:
   <img width="1488" alt="Screen Shot 2022-10-21 at 4 05 03 PM" 
src="https://user-images.githubusercontent.com/16479560/217875948-854560b2-59d1-44d7-99fa-cde87ceef4c5.png";>
   
   Deny-listing can obviously have the same issue. I believe `TermInSetQuery` 
was originally created to handle deny-lists in tinder search where there can be 
tens of thousands of "swipe left" profiles that need to be excluded from 
results?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to