[GitHub] [lucene] romseygeek merged pull request #12136: Don't wrap readers when checking for term vector access in test
romseygeek merged PR #12136: URL: https://github.com/apache/lucene/pull/12136 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] romseygeek closed issue #12115: org.apache.lucene.search.uhighlight.TestUnifiedHighlighterTermVec.testFetchTermVecsOncePerDoc fails reproducibly
romseygeek closed issue #12115: org.apache.lucene.search.uhighlight.TestUnifiedHighlighterTermVec.testFetchTermVecsOncePerDoc fails reproducibly URL: https://github.com/apache/lucene/issues/12115 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #12137: Add compression feature for DocValues format in new Codec
rmuir commented on issue #12137: URL: https://github.com/apache/lucene/issues/12137#issuecomment-1424104929 it doesn't make sense to compress integers with algorithms like these. We can use a better integer compression algorithm (e.g. PFOR) instead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #12135: Avoid duplicate sorting and prefix-encoding in KeywordField#newSetQuery
gsmiller commented on PR #12135: URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424361807 To avoid leaking internal implementation details of the two query implementations, I've taken a different approach that doesn't expose the fact that we're internally representing terms as prefix-coded data. This new approach just avoids duplicate sorting/cloning, but the prefix encoding is still done twice. I did some informal "benchmarking" of the two approaches with a simple unit test (see below), and the sorting/cloning of values is the most impactful bit of work to save. The duplicate prefix encoding didn't really register in the benchmarks (i.e., this approach and the previous approach were essentially the same). To make it a "fair fight" with the current version on `main`, I fixed an issue in the current implementation where the values are not cloned before being passed to the `SortedSetDocValuesSetQuery` ctor. The current code on `main` is sorting the provided values array in place, which probably isn't the right thing to do (we correctly clone everywhere else we use `SortedSetDocValuesSetQuery`). With the cloning in place, this PR cuts the "benchmark" results roughly in half (from 134ms to 68ms on my MacBook). Here's my simple "benchmark": ``` public void testSortPerformance() { int len = 5; BytesRef[] terms = new BytesRef[len]; for (int i = 0; i < len; i++) { String s = TestUtil.randomSimpleString(random(), 10, 20); terms[i] = new BytesRef(s); } int iters = 300; for (int i = 0; i < iters; i++) { KeywordField.newSetQuery("foo", terms); } long minTime = Long.MAX_VALUE; for (int i = 0; i < iters; i++) { long t0 = System.nanoTime(); KeywordField.newSetQuery("foo", terms); minTime = Math.min(minTime, System.nanoTime() - t0); } System.err.println("Time: " + minTime / 1_000_000); } ``` The downside of this approach is that we're duplicating the memory footprint of the prefix encoded terms. I think that's an OK trade-off for now though to avoid exposing implementation details of `TermInSetQuery`, since there's no way to reduce the visibility of that ctor (we'd have to just tag it `@lucene.internal`). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a diff in pull request #12135: Avoid duplicate sorting and prefix-encoding in KeywordField#newSetQuery
gsmiller commented on code in PR #12135: URL: https://github.com/apache/lucene/pull/12135#discussion_r1101623346 ## lucene/core/src/java/org/apache/lucene/document/KeywordField.java: ## @@ -168,8 +171,10 @@ public static Query newExactQuery(String field, String value) { public static Query newSetQuery(String field, BytesRef... values) { Objects.requireNonNull(field, "field must not be null"); Objects.requireNonNull(values, "values must not be null"); +SortedSet sortedValues = new TreeSet<>(Arrays.asList(values)); return new IndexOrDocValuesQuery( -new TermInSetQuery(field, values), new SortedSetDocValuesSetQuery(field, values)); Review Comment: If we don't end up merging this change, we should clone `values` before invoking `SortedSetDocValuesSetQuery#new` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a diff in pull request #12135: Avoid duplicate sorting and prefix-encoding in KeywordField#newSetQuery
gsmiller commented on code in PR #12135: URL: https://github.com/apache/lucene/pull/12135#discussion_r1101624944 ## lucene/core/src/java/org/apache/lucene/index/PrefixCodedTerms.java: ## @@ -39,6 +43,34 @@ public class PrefixCodedTerms implements Accountable { private long delGen; private int lazyHash; + /** Create {@link PrefixCodedTerms} from a single field and array of terms. */ + public static PrefixCodedTerms ofFieldTerms(String field, BytesRef... terms) { +return ofFieldTerms(field, Arrays.asList(terms)); + } + + /** Create a {@link PrefixCodedTerms} for a single field and collection of terms. */ + public static PrefixCodedTerms ofFieldTerms(String field, Collection terms) { Review Comment: I ended up keeping this in this alternate version as a way of "marking" data terms as already being sorted. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery
rmuir commented on PR #12135: URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424397388 I still don't understand why we are adding all this special logic to PrefixCodedTerms, to avoid a single Arrays.sort? The sorting only happens once in the query, its not like it takes milliseconds or anything to sort a damn array! Need to understand the use-case here. I suspect its a severe database-abuse case. I have to push back on these since lucene is a search engine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery
rmuir commented on PR #12135: URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424408782 btw, latest commit is slightly better as at least it does not expose `PrefixCodedTerms` in a public API. But it brings java Collections into the picture, which seems not great versus using simple arrays. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent opened a new pull request, #12138: Force merge into a single segment before getting the directory reader
benwtrent opened a new pull request, #12138: URL: https://github.com/apache/lucene/pull/12138 The test assumes a single segment is created (because only one scorer is created from the leaf contexts). But, force merging wasn't done before getting the reader. Forcemerge to a single segment before getting the reader. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent merged pull request #12138: Force merge into a single segment before getting the directory reader
benwtrent merged PR #12138: URL: https://github.com/apache/lucene/pull/12138 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery
gsmiller commented on PR #12135: URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424470476 @rmuir The use case where I've seen this in the wild has to to with allow/deny lists. We have some use-cases where we only want to match documents that exist in some allow-list. That allow-list can be quite large (potentially tens of thousands), but many of the terms aren't present in a given index we're searching. We use the bloom filter codec to efficiently drop terms not present. So we have a large number of terms we need to initialize our `TermInSetQuery` with, but a much smaller number of them actually end up term-seeking, etc., so the sorting actually appears to dominate when we've profiled. I've "redacted" a bunch of this flame chart since this was on an internal system, but you can see how long we're spending sorting terms vs. everything else here: https://user-images.githubusercontent.com/16479560/217875948-854560b2-59d1-44d7-99fa-cde87ceef4c5.png";> Deny-listing can obviously have the same issue. I believe `TermInSetQuery` was originally created to handle deny-lists in tinder search where there can be tens of thousands of "swipe left" profiles that need to be excluded from results? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery
gsmiller commented on PR #12135: URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424473544 > But it brings java Collections into the picture, which seems not great versus using simple arrays. I don't love that either. An alternative could be to just require the ctor to be invoked with a pre-sorted array and not mess about with array copying and sorting. It would of course need to be a non-back compatible change released in 10, but that's OK. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #12139: Skip the TokenStream overhead when indexing simple keywords.
jpountz commented on PR #12139: URL: https://github.com/apache/lucene/pull/12139#issuecomment-1424577832 I ran this made-up benchmark to try to assess the benefits of the change. It's not representative of a real-world scenario since it disables merging (to reduce noise), but it still indexes a combination of terms plus doc values and includes flush times so it includes more than just keyword indexing. ```java public static void main(String[] args) throws IOException { Directory dir = FSDirectory.open(Paths.get("/tmp/a")); for (int iter = 0; iter < 100; ++iter) { IndexWriterConfig cfg = new IndexWriterConfig(null) .setOpenMode(OpenMode.CREATE) .setMergePolicy(NoMergePolicy.INSTANCE) .setMaxBufferedDocs(200_000) .setRAMBufferSizeMB(IndexWriterConfig.DISABLE_AUTO_FLUSH); long start = System.nanoTime(); try (IndexWriter w = new IndexWriter(dir, cfg)) { Document doc = new Document(); KeywordField field1 = new KeywordField("field1", new BytesRef(1), Field.Store.NO); doc.add(field1); KeywordField field2 = new KeywordField("field2", new BytesRef(1), Field.Store.NO); doc.add(field2); KeywordField field3 = new KeywordField("field3", new BytesRef(1), Field.Store.NO); doc.add(field3); for (int i = 0; i < 10_000_000; ++i) { field1.binaryValue().bytes[0] = (byte) i; field2.binaryValue().bytes[0] = (byte) (3 * i); field3.binaryValue().bytes[0] = (byte) (5 * i); w.addDocument(doc); } } long end = System.nanoTime(); System.out.println((end - start) / 1_000_000 + " ns per doc"); } } ``` Before the change, indexing takes 5.3us per document. After the change it takes 4.3us. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery
uschindler commented on PR #12135: URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424631344 Hi, I fully agree with Robert here. I see no reason why you want to avoid a stupid no-op Arrays.sort(). Yes if the array is large, the sort takes some time, sorry. But what does this have to do with PrefixCodedTerms? Why is Elasticsearch using this private API? I would agree if we can internally have some special-case ctor that is package private (or it is shielded by using some shared-secrets approach if it needs to work cross-package), which allows to "reuse" an already existing PrefixCodedTerms instance internally when you build a TermInSetQuery from some case like building a TermInSetQuery for the joins module from a Term list. But PrefixCodedTerms should never ever be part of public APIs. If you want it "correct" use Java 8 streams API, although Adrien does not like it, but it is the "correct way" to do this kind of stuff. If Elasticsearch needs to do some internal stuff and can't live with a (cheap) resort of an array or using java.util.Stream (if it is already sorted, TimSort will just linear scan through the array), it should fork the query as this is a special case and does not need to be supported in Lucene. All APIs are public, so you could make a subclass/fork of TermInSetQuery that relies on some Elasticsearch-internal array container clsses that have a defined sort order. It is not Lucene's task to add support in our public APIs for some misuse of Lucene (and those lists of terms for access rights is a misuse of Lucene: it is a search engine and not a US NASA-for-travel-to-mars certified FIPS-0815 foobar bullshit with high security compliant and unbreakable datastore with strict user access checks on millions of documents). Tell your customers that this is slower if you apply crappy access rights that should never have been in a serach engine! Keep in mind: In any public API we have for sure to either always sort it or alterantively if we require a presorted array, we have to actually check this. If one passes an array in a public parameter and the defintion says it has to be presorted: WE MUST CHECK THIS!. If one would pass an unsorted array we would need to throw exception, so we need to check correct order as part of param validation. But as checking that the array is in correct order is as expensive as just sorting it, we should simply sort it. There's no discussion possible! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery
gsmiller commented on PR #12135: URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424699768 Let me try to clarify a couple things: First, this is not an Elasticsearch use-case. Probably not all the important, but this is for an Amazon Product Search case, where we use Lucene directly. We're not applying access right, etc. We have use-cases where we need to evaluate allow-lists of products. Yes, we can fork all these queries. But this looks like an opportunity to avoid cloning and sorting an array multiple times. If both of these underlying queries need sorted terms, it seems like it would be nice to avoid it. Yes, a public constructor that requires sorted data should check it. The current proposal is doing this by leveraging the class type of the collection. I also agree with not exposing the prefix-encoded implementation detail in the constructor, which I've also addressed in the latest revision. It seems like the pushback here amounts to, 1) not wanting to mess with collections instead of arrays, and 2) generally feeling like this is an unnecessary complication for a non-general use-case. I personally disagree on both points. Sure, arrays are better in a number of ways, but this seems like a reasonable use of collections to "mark" sortedness as part of the type contract. While I also recognize that large cardinality term cases maybe aren't the "norm," we could extend that argument to say we shouldn't both with `TermInSetQuery` at all. BooleanQuery is just fine with few terms. I could create a slippery slope argument pretty quickly that says, "no more term in set at all!" I don't like what's at the bottom of that slope. So I disagree with the pushback (at least in the current revision; I fully understand the pushback around not exposing prefix encoding details). But I don't disagree so strongly that I'll keep fighting for this if folks are unhappy with it. I'll do one more pass to see if I can leverage `TimSort`'s efficient implementation on pre-sorted data. If that pans out, maybe it's enough to just sort the terms in `KeywordField` before passing them along. The underlying queries can still do their own sorting, which will just be a no-op. We'll still do some unnecessary array cloning, but maybe it doesn't matter. Anyway, I appreciate the feedback and the discussion! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery
uschindler commented on PR #12135: URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424845306 Sorry I thought this is Elasticsearch, because generally I get like daiky people asking how to hide their documents from users. At end they are always complaining that their SQL IN like queries are slow. This is why I always get angry and try to explain to people that they need to nenormalize their data. Instead of having huge lists of documents a user may see, it is better to instead index a multivalued field "users that can see this document". This always causes problems with those people that want to add new users all the time and need to therefore reindex everything. My general rule I tell them is always: index the user information in your weekly reindex and for those new ones to the heavy "SQL IN" query. But for the users that have their document they can see already reibdexes you can just query for a single term (their username". So you may understand that I get angry about that at some point. License has no way to manage access rights with all shit like group access and do on. You need to always so some mixed approach. Denormalize as much as possible, and if not possible or partially use explicit IN queries. But really avoid that and use the inverted index for what it was invented. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery
uschindler commented on PR #12135: URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424863546 I disagree with Robert who says "only arrays". I agree with you we can also allow to pass collections. But only when we do it as proposed before: We can have 2 ctors, one with `Collection` and one with `BytesRef[]`. But both should call a third hidden ctor taking `Stream`. This internal implementation would call: `stream.sorted()` (and possibly also `.distinct()`) and just operate on the stream. If you pass in a SortedSet (e.g. TreeSet) the sorted and distinct calls will be no-ops. It will not sort the stuff again, IF (and only IF) the comparator of the Treeset/sortedset is exactly the one we use. So it is 100% type safe. I hope Adrien understand that this is the best way to avoid duplicate sorting: ```java public TermInSetQuery(String field, BytesRef[] terms) { this(field, Arrays.stream(terms); } public TermInSetQuery(String field, Collection terms) { this(field, terms.stream(); } private TermInSetQuery(String field, Stream stream) { super(field); // and so on stream.sorted().distinct().forEachOrdered(term -> { // process the terms coming in natural order }); } ``` This would be my favorite way to implement this query. I don't think Adrien's mental problem with streams is an argument without a benchmark. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #12139: Skip the TokenStream overhead when indexing simple keywords.
rmuir commented on PR #12139: URL: https://github.com/apache/lucene/pull/12139#issuecomment-1424969915 > I hesitated doing the same with `StringField` but wondered if this could be breaking to some users who might pull a `TokenStream` themselves. Maybe we can somehow deprecate using a tokenstream there in 9.x, pulling a tokenstream is very expert and doesn't seem like StringField needs to support that? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #12139: Skip the TokenStream overhead when indexing simple keywords.
rmuir commented on PR #12139: URL: https://github.com/apache/lucene/pull/12139#issuecomment-1424973701 high level, i dont think its a big problem, but we are adding some type-guessing, with a lot of runtime checks, versus the user somehow having some type safety via the .document package. Similar to what got fixed recently in stored fields (#12116). Just worth thinking about, is there anyway this can be more type-safe to the user in the API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dnhatn opened a new issue, #12140: LRUQueryCache disabled for indices with more 34 segments
dnhatn opened a new issue, #12140: URL: https://github.com/apache/lucene/issues/12140 ### Description An Elasticsearch customer reported a search performance issue. We looked into the segment stats and found that the index has 34 * 5GB segments, and LRUQueryCache never cache these segments. The reason is that LRUQueryCache only caches segments that have [more than 3%](https://github.com/apache/lucene/blob/7baa01b3c2f93e6b172e986aac8ef577a87ebceb/lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java#L139-L148) of the total number of documents in the index, and all segments here have less than 3%. I will work on the fix for this issue. Any suggestions are welcomed. ### Version and environment details _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org