[PR] Introduce multiSelect for ScalarQuantizer [lucene]

via GitHub Tue, 15 Oct 2024 12:52:40 -0700


HoustonPutman opened a new pull request, #13919:
URL: https://github.com/apache/lucene/pull/13919


   Resolves #13918 
   
   ### Description
   
   This introduces a `multiSelect(from, to, k[])` method on the `Selector` 
abstract class, and gives implementations of the method for both `Selector` 
implementations, `IntroSelector` and `RadixSelector`.
   
   This is really only used so far in the `ScalarQuantizer`, which uses 
`IntroSelector`, to find confidence intervals (quantiles) for lists of vectors. 
This code was refactored to find all quantiles at once. 
`ScalarQuantizer.fromVectors()` does 1 confidence interval (2 quantiles), and 
`ScalarQuantizer.fromVectorsAutoInterval()` does 2 confidence intervals (4 
quantiles).
   
   ### multiSelect details
   
   For both `RadixSelector` and `IntroSelector`, the `multiSelect()` code is 
not too dissimilar from the `select()` code, however all paths are taken that 
could lead to one of the `k` values. In each path, only the relevant `k` values 
are checked. If a path has only 1 `k` value, then `select()` is called instead 
of `multiSelect()`.
   
   One difference in `RadixSelector` is that recursive calls are not done 
in-place, rather kept in a list to do after checking all path-options, so that 
we don't have to keep a stack of histograms. This should not be expensive, 
because the list that contains these recursive-call-path-options, is capped at 
the number of valid `k` values for that path, which will likely be small most 
of the time.
   
   Also I provided a default `Selector.multiSelect()` call that is not 
optimized, because `Selector` is a public class that people might have made 
custom implementations of?
   
   ### Benchmarking
   
   Real benchmarking still needs to be done, but using IntelliJ's profiler on 
`TestScalarQuantizer.testFromVectorsAutoInterval4Bit()`, I have seen a **>20% 
speed improvement** in the call to `ScalarQuantizer.fromVectorsAutoInterval()` 
when using 1028 dimensions (310 ms -> 240 ms). This speedup holds when using 
1000 vectors instead of 100  (3200 ms -> 2500 ms) and when using 1000 vectors 
and a smaller dimension of 128 (480 ms -> 370 ms).
   
   I'd love to make benchmarks that fit with the way Lucene does them. Are 
there already vector benchmarks that I could build off of, or do I need to 
start from scratch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[PR] Introduce multiSelect for ScalarQuantizer [lucene]

Reply via email to