gsmiller commented on PR #12135: URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424361807
To avoid leaking internal implementation details of the two query implementations, I've taken a different approach that doesn't expose the fact that we're internally representing terms as prefix-coded data. This new approach just avoids duplicate sorting/cloning, but the prefix encoding is still done twice. I did some informal "benchmarking" of the two approaches with a simple unit test (see below), and the sorting/cloning of values is the most impactful bit of work to save. The duplicate prefix encoding didn't really register in the benchmarks (i.e., this approach and the previous approach were essentially the same). To make it a "fair fight" with the current version on `main`, I fixed an issue in the current implementation where the values are not cloned before being passed to the `SortedSetDocValuesSetQuery` ctor. The current code on `main` is sorting the provided values array in place, which probably isn't the right thing to do (we correctly clone everywhere else we use `SortedSetDocValuesSetQuery`). With the cloning in place, this PR cuts the "benchmark" results roughly in half (from 134ms to 68ms on my MacBook). Here's my simple "benchmark": ``` public void testSortPerformance() { int len = 50000; BytesRef[] terms = new BytesRef[len]; for (int i = 0; i < len; i++) { String s = TestUtil.randomSimpleString(random(), 10, 20); terms[i] = new BytesRef(s); } int iters = 300; for (int i = 0; i < iters; i++) { KeywordField.newSetQuery("foo", terms); } long minTime = Long.MAX_VALUE; for (int i = 0; i < iters; i++) { long t0 = System.nanoTime(); KeywordField.newSetQuery("foo", terms); minTime = Math.min(minTime, System.nanoTime() - t0); } System.err.println("Time: " + minTime / 1_000_000); } ``` The downside of this approach is that we're duplicating the memory footprint of the prefix encoded terms. I think that's an OK trade-off for now though to avoid exposing implementation details of `TermInSetQuery`, since there's no way to reduce the visibility of that ctor (we'd have to just tag it `@lucene.internal`). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org