HoustonPutman opened a new pull request, #13919: URL: https://github.com/apache/lucene/pull/13919
Resolves #13918 ### Description This introduces a `multiSelect(from, to, k[])` method on the `Selector` abstract class, and gives implementations of the method for both `Selector` implementations, `IntroSelector` and `RadixSelector`. This is really only used so far in the `ScalarQuantizer`, which uses `IntroSelector`, to find confidence intervals (quantiles) for lists of vectors. This code was refactored to find all quantiles at once. `ScalarQuantizer.fromVectors()` does 1 confidence interval (2 quantiles), and `ScalarQuantizer.fromVectorsAutoInterval()` does 2 confidence intervals (4 quantiles). ### multiSelect details For both `RadixSelector` and `IntroSelector`, the `multiSelect()` code is not too dissimilar from the `select()` code, however all paths are taken that could lead to one of the `k` values. In each path, only the relevant `k` values are checked. If a path has only 1 `k` value, then `select()` is called instead of `multiSelect()`. One difference in `RadixSelector` is that recursive calls are not done in-place, rather kept in a list to do after checking all path-options, so that we don't have to keep a stack of histograms. This should not be expensive, because the list that contains these recursive-call-path-options, is capped at the number of valid `k` values for that path, which will likely be small most of the time. Also I provided a default `Selector.multiSelect()` call that is not optimized, because `Selector` is a public class that people might have made custom implementations of? ### Benchmarking Real benchmarking still needs to be done, but using IntelliJ's profiler on `TestScalarQuantizer.testFromVectorsAutoInterval4Bit()`, I have seen a **>20% speed improvement** in the call to `ScalarQuantizer.fromVectorsAutoInterval()` when using 1028 dimensions (310 ms -> 240 ms). This speedup holds when using 1000 vectors instead of 100 (3200 ms -> 2500 ms) and when using 1000 vectors and a smaller dimension of 128 (480 ms -> 370 ms). I'd love to make benchmarks that fit with the way Lucene does them. Are there already vector benchmarks that I could build off of, or do I need to start from scratch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org