gsmiller commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424361807
To avoid leaking internal implementation details of the two query
implementations, I've taken a different approach that doesn't expose the fact
that we're internally representing terms as prefix-coded data.
This new approach just avoids duplicate sorting/cloning, but the prefix
encoding is still done twice. I did some informal "benchmarking" of the two
approaches with a simple unit test (see below), and the sorting/cloning of
values is the most impactful bit of work to save. The duplicate prefix encoding
didn't really register in the benchmarks (i.e., this approach and the previous
approach were essentially the same).
To make it a "fair fight" with the current version on `main`, I fixed an
issue in the current implementation where the values are not cloned before
being passed to the `SortedSetDocValuesSetQuery` ctor. The current code on
`main` is sorting the provided values array in place, which probably isn't the
right thing to do (we correctly clone everywhere else we use
`SortedSetDocValuesSetQuery`).
With the cloning in place, this PR cuts the "benchmark" results roughly in
half (from 134ms to 68ms on my MacBook).
Here's my simple "benchmark":
```
public void testSortPerformance() {
int len = 50000;
BytesRef[] terms = new BytesRef[len];
for (int i = 0; i < len; i++) {
String s = TestUtil.randomSimpleString(random(), 10, 20);
terms[i] = new BytesRef(s);
}
int iters = 300;
for (int i = 0; i < iters; i++) {
KeywordField.newSetQuery("foo", terms);
}
long minTime = Long.MAX_VALUE;
for (int i = 0; i < iters; i++) {
long t0 = System.nanoTime();
KeywordField.newSetQuery("foo", terms);
minTime = Math.min(minTime, System.nanoTime() - t0);
}
System.err.println("Time: " + minTime / 1_000_000);
}
```
The downside of this approach is that we're duplicating the memory footprint
of the prefix encoded terms. I think that's an OK trade-off for now though to
avoid exposing implementation details of `TermInSetQuery`, since there's no way
to reduce the visibility of that ctor (we'd have to just tag it
`@lucene.internal`).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]