gashutos opened a new issue, #12448:
URL: https://github.com/apache/lucene/issues/12448

   ### Description
   
   ## Problem statement
   Currently in `TopFieldCollector`, we have PriorityQueue (min heap binary 
implemenation) to find top `K` elements in `asc` or `desc` order. Whenever we 
find newly iterated document competitive, we insert that in PriorityQueue and 
if size is full, we remove `min` competitive document from PriorityQueue and 
replace with newly competitive document. Time complexity for this scenatio is 
`O(nlog(K))` in worst case where `n = current number of competitive documents`. 
   Now what if `n` is pretty high and every document we try to insert in 
PriorityQueue take exact O(log(K)) ? We will end up spending lot of time in 
that case within PriorityQueue heapification process itself.
   We have skipping logic to reduce iterator cost 'n' whenever we update 
PriorityQueue, but that doesnt work well in some scenario.
   
   ### For example, 
   for time series based logs/metric data where sort query is performed on 
`timestamp` field. This field will be always ever increasing (nearly sorted) 
field and it has very high cardinality as well ! Triggering `desc` order 
queries wont optimize much with skipping logic as well PriorityQueue will 
always result `O(log(K))` for almost all heapification insertion because 
incming documents are coming in increasing order.
   
   Like imagine below scenario where we are inserting logical timestamp in 
incrwasing order `1, 2, 3, 4, ....` and insertion of 16 will result in worst 
case to heapify 16 till leaf node, same is true for `17, 18, 19,.....`.  
(Imagine we need top `15` hit so PQ size is 15).
   
   <img width="1182" alt="image" 
src="https://github.com/apache/lucene/assets/104654647/dcdd1e36-ebc7-46f3-a847-fdec243b0c26";>
   
   Here in above scenario, skipping logic doesnt work too to find top `K` in 
`desc` order. But we can do better on priorityqueue side since we know incoming 
documents are in `asc` order (nearly sorted) so can skip heapifying them.
   
   ## Solution proposed
   The crux of the solution proposed is, `We dont need to heapify elements if 
the incoming order is sorted as per desired order`.  The idea here is to keep 
incoming competitive documents in temporary `circular array` before inserting 
them into priorityqueue and trigger heapify. 
   
   ### Algorithm
   1. Create circular array of same size as priority queue. We shoud be able  
add element to `last` and remove element from `first` in this circular array 
implementation.
   2. If incoming document qualifies after 
`TopFieldCollector.checkThreshold()`, keep inserting them in circular array 
until order breaks, i.e. if next document is smaller for descending order 
traveral or vice versa for ascending order. Remove first element from circular 
array if queue is full while inserting.
   3. Once order breaks, insert entire circular array into priority queue and 
empty the circular array.
   
   With this, we will able to skip millions of hits from going through 
heapification process. 
   
   POC : [POC 
code](https://github.com/gashutos/lucene/commit/1822f6a91b8fa11aebe80fe04b124f2c48ae2e6f)
   _Implemented above POC for reference with LinkedList instead circular array. 
Also need some additional stuff to handle cases like duplicate element where 
docid is given preference (But those should be minor changes)._
   
   ### Caveat of this implementation
   Above proposed solution works pretty well for ever increasing or ever 
decreasing ordered fields like time series based data.  But this could be 
additional overhead if data is completely random. This will incur additional 
comparison to add element in circular array first and then inserting to actual 
heap.
   
   We have two ways to implement this, 
   1. Have this implementation by default for all types of BKD based indexed 
field.
   2. Enable this implementation behind `boolean` flag, and it can be set 
`true/false` at collector level similar to what we have for 
`PointBasedOptimization` flag.
   
   The approach 2 sounds very safe, but it results on more configuration 
overhead for users.
   The approach 1 can introduce overhead, but with the random data the skipping 
logic based on BKD points works pretty well and insertion on priority queue 
itself wont be much and hence we dont see much overhead. I have tested this 
with `13` million documents randomly on a numeric field (long) and with 
solution 1, we didnt see any regression in latency since skipping logic skipped 
many hits already.
   We dont have benchmark for numeric sort in Lucene itself, but I did indexed 
LongPoint in random order with 13 million points/documents. I see similar 
latency for sort `140 ms` while with this additional circular queue it is `147 
ms` on `r6g.2xlarge` ec2 instances.
   
   ## Suggestions from community
   I would like to hear suggestion thoughts from community for above proposed 
change and better way to implement this. Specially time series based `desc` 
order usecase is pretty widely used, this can be covering larger pie for our 
users to get more optimized sort queries on such workload.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to