gashutos opened a new issue, #12448: URL: https://github.com/apache/lucene/issues/12448
### Description ## Problem statement Currently in `TopFieldCollector`, we have PriorityQueue (min heap binary implemenation) to find top `K` elements in `asc` or `desc` order. Whenever we find newly iterated document competitive, we insert that in PriorityQueue and if size is full, we remove `min` competitive document from PriorityQueue and replace with newly competitive document. Time complexity for this scenatio is `O(nlog(K))` in worst case where `n = current number of competitive documents`. Now what if `n` is pretty high and every document we try to insert in PriorityQueue take exact O(log(K)) ? We will end up spending lot of time in that case within PriorityQueue heapification process itself. We have skipping logic to reduce iterator cost 'n' whenever we update PriorityQueue, but that doesnt work well in some scenario. ### For example, for time series based logs/metric data where sort query is performed on `timestamp` field. This field will be always ever increasing (nearly sorted) field and it has very high cardinality as well ! Triggering `desc` order queries wont optimize much with skipping logic as well PriorityQueue will always result `O(log(K))` for almost all heapification insertion because incming documents are coming in increasing order. Like imagine below scenario where we are inserting logical timestamp in incrwasing order `1, 2, 3, 4, ....` and insertion of 16 will result in worst case to heapify 16 till leaf node, same is true for `17, 18, 19,.....`. (Imagine we need top `15` hit so PQ size is 15). <img width="1182" alt="image" src="https://github.com/apache/lucene/assets/104654647/dcdd1e36-ebc7-46f3-a847-fdec243b0c26"> Here in above scenario, skipping logic doesnt work too to find top `K` in `desc` order. But we can do better on priorityqueue side since we know incoming documents are in `asc` order (nearly sorted) so can skip heapifying them. ## Solution proposed The crux of the solution proposed is, `We dont need to heapify elements if the incoming order is sorted as per desired order`. The idea here is to keep incoming competitive documents in temporary `circular array` before inserting them into priorityqueue and trigger heapify. ### Algorithm 1. Create circular array of same size as priority queue. We shoud be able add element to `last` and remove element from `first` in this circular array implementation. 2. If incoming document qualifies after `TopFieldCollector.checkThreshold()`, keep inserting them in circular array until order breaks, i.e. if next document is smaller for descending order traveral or vice versa for ascending order. Remove first element from circular array if queue is full while inserting. 3. Once order breaks, insert entire circular array into priority queue and empty the circular array. With this, we will able to skip millions of hits from going through heapification process. POC : [POC code](https://github.com/gashutos/lucene/commit/1822f6a91b8fa11aebe80fe04b124f2c48ae2e6f) _Implemented above POC for reference with LinkedList instead circular array. Also need some additional stuff to handle cases like duplicate element where docid is given preference (But those should be minor changes)._ ### Caveat of this implementation Above proposed solution works pretty well for ever increasing or ever decreasing ordered fields like time series based data. But this could be additional overhead if data is completely random. This will incur additional comparison to add element in circular array first and then inserting to actual heap. We have two ways to implement this, 1. Have this implementation by default for all types of BKD based indexed field. 2. Enable this implementation behind `boolean` flag, and it can be set `true/false` at collector level similar to what we have for `PointBasedOptimization` flag. The approach 2 sounds very safe, but it results on more configuration overhead for users. The approach 1 can introduce overhead, but with the random data the skipping logic based on BKD points works pretty well and insertion on priority queue itself wont be much and hence we dont see much overhead. I have tested this with `13` million documents randomly on a numeric field (long) and with solution 1, we didnt see any regression in latency since skipping logic skipped many hits already. We dont have benchmark for numeric sort in Lucene itself, but I did indexed LongPoint in random order with 13 million points/documents. I see similar latency for sort `140 ms` while with this additional circular queue it is `147 ms` on `r6g.2xlarge` ec2 instances. ## Suggestions from community I would like to hear suggestion thoughts from community for above proposed change and better way to implement this. Specially time series based `desc` order usecase is pretty widely used, this can be covering larger pie for our users to get more optimized sort queries on such workload. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org