[GitHub] [lucene] gsmiller commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

via GitHub Thu, 09 Feb 2023 08:28:55 -0800


gsmiller commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424470476

@rmuir The use case where I've seen this in the wild has to to with
allow/deny lists. We have some use-cases where we only want to match documents
that exist in some allow-list. That allow-list can be quite large (potentially
tens of thousands), but many of the terms aren't present in a given index we're
searching. We use the bloom filter codec to efficiently drop terms not present.
So we have a large number of terms we need to initialize our `TermInSetQuery`
with, but a much smaller number of them actually end up term-seeking, etc., so
the sorting actually appears to dominate when we've profiled.

I've "redacted" a bunch of this flame chart since this was on an internal
system, but you can see how long we're spending sorting terms vs. everything
else here:
<img width="1488" alt="Screen Shot 2022-10-21 at 4 05 03 PM"
src="https://user-images.githubusercontent.com/16479560/217875948-854560b2-59d1-44d7-99fa-cde87ceef4c5.png";>

Deny-listing can obviously have the same issue. I believe `TermInSetQuery`
was originally created to handle deny-lists in tinder search where there can be
tens of thousands of "swipe left" profiles that need to be excluded from
results?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

Reply via email to