gsmiller commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424361807

   To avoid leaking internal implementation details of the two query 
implementations, I've taken a different approach that doesn't expose the fact 
that we're internally representing terms as prefix-coded data.
   
   This new approach just avoids duplicate sorting/cloning, but the prefix 
encoding is still done twice. I did some informal "benchmarking" of the two 
approaches with a simple unit test (see below), and the sorting/cloning of 
values is the most impactful bit of work to save. The duplicate prefix encoding 
didn't really register in the benchmarks (i.e., this approach and the previous 
approach were essentially the same).
   
   To make it a "fair fight" with the current version on `main`, I fixed an 
issue in the current implementation where the values are not cloned before 
being passed to the `SortedSetDocValuesSetQuery` ctor. The current code on 
`main` is sorting the provided values array in place, which probably isn't the 
right thing to do (we correctly clone everywhere else we use 
`SortedSetDocValuesSetQuery`).
   
   With the cloning in place, this PR cuts the "benchmark" results roughly in 
half (from 134ms to 68ms on my MacBook).
   
   Here's my simple "benchmark":
   ```
     public void testSortPerformance() {
       int len = 50000;
       BytesRef[] terms = new BytesRef[len];
       for (int i = 0; i < len; i++) {
         String s = TestUtil.randomSimpleString(random(), 10, 20);
         terms[i] = new BytesRef(s);
       }
   
       int iters = 300;
       for (int i = 0; i < iters; i++) {
         KeywordField.newSetQuery("foo", terms);
       }
   
       long minTime = Long.MAX_VALUE;
       for (int i = 0; i < iters; i++) {
         long t0 = System.nanoTime();
         KeywordField.newSetQuery("foo", terms);
         minTime = Math.min(minTime, System.nanoTime() - t0);
       }
   
       System.err.println("Time: " + minTime / 1_000_000);
     }
   ```
   
   The downside of this approach is that we're duplicating the memory footprint 
of the prefix encoded terms. I think that's an OK trade-off for now though to 
avoid exposing implementation details of `TermInSetQuery`, since there's no way 
to reduce the visibility of that ctor (we'd have to just tag it 
`@lucene.internal`).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to