[PR] Use multi-select instead of a full sort for DynamicRange creation [lucene]

via GitHub Mon, 14 Oct 2024 17:22:01 -0700


HoustonPutman opened a new pull request, #13914:
URL: https://github.com/apache/lucene/pull/13914


   Resolves #13760
   
   ### Description
   
   <!--
   If this is your first contribution to Lucene, please make sure you have 
reviewed the contribution guide.
   https://github.com/apache/lucene/blob/main/CONTRIBUTING.md
   -->
   
   This is using a similar approach to how Solr used to compute multiple 
percentiles at a single time. Basically utilize the quick select method, but 
instead of following a single path, follow a path for each of the `k`s that is 
requested. Multi-quickselect.
   
   That's what I originally made, until I realized that the `DynamicRangeUtil` 
is weighted, so I refactored it to choose by weights instead, and also capture 
the running-value-total and running-weight-total, because that information is 
used in the `DynamicRangeInfo`.
   
   My goal was to add this as a generic capability of the `Selector` (or 
`IntroSelector`) class, but because of the limitations above, it is currently a 
separate class to handle this. If there's any suggestions on how to make this 
generic enough to be put in the generic class, that would be great. But it 
might not be worth the effort if it wouldn't be used anywhere else.
   
   As for the original multi-quickSelect algorithm I mentioned, I looked for 
other multi-select use cases across Lucene, but I only found one instance (The 
quantized does two select calls in succession). If there's more instances we 
can find, I would be happy to add multiSelect as an option on the `Selector` 
class, and implement it in all provided classes.
   
   ### To-Do
   
   - The code needs to be cleaned up and better documented, this is just a POC
   - Benchmarks comparing this to the full-sorting implementation.
   
   ### Caveat
   
   The implement is slightly different, as it will pick the groups according to 
"The first value for which the running weight is <= weight-range-boundary". The 
old logic would start counting again after a weight range was complete, which 
removes information from the overflow of previous weight-ranges. I'm not sure 
either approach is right or wrong, but I wanted to explicitly state how the 
results would be different and why I had to alter a unit test to pass.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[PR] Use multi-select instead of a full sort for DynamicRange creation [lucene]

Reply via email to