jpountz opened a new pull request, #14204: URL: https://github.com/apache/lucene/pull/14204
This is inspired from a paper by Tencent where the authors describe how they speed up so-called "histogram queries" by sorting the index by timestamp and translating ranges of values corresponding to each histogram bucket to ranges of doc IDs. This way, at collection time, they no longer need to look up values and can compute the histogram purely by looking at collected doc IDs. YU, Muzhi, LIN, Zhaoxiang, SUN, Jinan, et al. TencentCLS: the cloud log service with high query performances. Proceedings of the VLDB Endowment, 2022, vol. 15, no 12, p. 3472-3482. Instead of binary-searching the doc ID space to translate histogram buckets into ranges of doc IDs, the new collector manager uses recently introduced support for sparse indexing. When playing with the geonames dataset, computing a histogram of the elevation field runs ~2-3x faster with this optimization than with the naive implementation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org