[GitHub] [lucene] gashutos commented on issue #12448: [Performance] sort query improvement for sequential ordered data [e.g. timestamp field sort in log data]

via GitHub Thu, 20 Jul 2023 12:10:50 -0700


gashutos commented on issue #12448:
URL: https://github.com/apache/lucene/issues/12448#issuecomment-1644457521


   >Thinking a bit more about this optimization, I wonder if it would still 
work well under concurrent indexing. If I understand the optimization 
correctly, it relies on the fact that the n-th collected document would 
generally have a more competitive value than the (n-k)-th collected document to 
keep inserting into the circular buffer. But this wouldn't be true, e.g. under 
concurrent indexing if flushing segments that have (k+1) docs or more?
   >
   >For instance, assume two indexing threads that index 10 documents each 
between two consecutive refreshes. The first segment could have timestamps 0, 
2, 4, ..., 18 and the second segment could have timestamps 1, 3, 5, ..., 19. 
Then when they get merged, this would create a segment whose timestamps would 
be 0, 2, 4, ..., 18, 1, 3, 5, ..., 19. Now if you collect the top-5 hits by 
descending timestamp, the optimization would automatically disable itself when 
it has timestamps [10, 12, 14, 16, 18] in the queue and sees timestamp 1, since 
1 < 10?
   
   Yes, this wont optimize scenario where concurrent flush is invoked with very 
less document say 10 documents per flush. Reading at this article [concurrent 
flush](https://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html)
 (it might be old article and not the latest detail on concurrent flushing, and 
I hope you are talking about this as concurrent indexing), it looks like a 
single flish would still contain 1000s of document to be flushed if we talk 
about millions.
   This optimization would still work very well if this number of document per 
single flush is higher and the top(k) is smaller.
   i.e If single flush is 1000 and if we need top(100) in descending order, we 
will still able to skip 900 documents from going through heapifycation process.
   
   >For instance, assume two indexing threads that index 10 documents each 
between two consecutive refreshes. The first segment could have timestamps 0, 
2, 4, ..., 18 and the second segment could have timestamps 1, 3, 5, ..., 19. 
Then when they get merged, this would create a segment whose timestamps would 
be 0, 2, 4, ..., 18, 1, 3, 5, ..., 19. Now if you collect the top-5 hits by 
descending timestamp, the optimization would automatically disable itself when 
it has timestamps [10, 12, 14, 16, 18] in the queue and sees timestamp 1, since 
1 < 10?
   
   Like in the example you have given, lets modify it slightly like below,
   `assume two indexing threads that index 100 documents each between two 
consecutive refreshes. The first segment could have timestamps 0, 2, 4, ..., 
198 and the second segment could have timestamps 1, 3, 5, ..., 190. Then when 
they get merged, this would create a segment whose timestamps would be 0, 2, 4, 
..., 198, 1, 3, 5, ..., 197. Now if you collect the top-5 hits by descending 
timestamp`
   In this case, 
   1. once we have 0,2,4,6,8 in the priority queue, 10 will be insterted in 
circular array, similarly 12, 14, 16 & 18. Those wont go through heapifycation 
as we are delaying it. 
   2. Now 20 comes, we know order is still maintained, so we will insert 20 in 
circular array and remove 12 (12 is completely removed, neither it goes to 
priority queue). Same thing will happen for 22, 24,, ....198. 
   3. At this stage, we are left with 0,2,4,6,8 in priority queue & 
192,194,196,198 in circular array.
   4. Now 1 comes, and it breaks the sequence, so we will insert 
192,194,196,198 in the priority queue and since priority queue size is 5, it 
will remove 0,2,4,6,8.  1 will be only present in circular array.
   5. 3, 5, 7, 9 comes and those will be instered in circular array since order 
is intact. (Those wont pass checkThreshold() but lets assume checkThreshold() 
is not there.)
   6. 11 comes and it will be insterted in circular array and and 5 will be 
removed. Same thing will keep going until we have 191, 193, 195, 197, 199 in 
curcular array.
   7. We are done now, we need to empty circular array and insert remaining 5 
element in priority queue.
   
   So we ended up skipping 190 hit going through heapification process out of 
200.
   
   _Just for explanation I didnt take skipping logic in consideration else in 
step 4, we will have all the hits from 1 to 189 would be non-comprtitive marked 
by BKD point based competitive._
   
   Thank you for reading this long explanation and I hope this makes clarity to 
your doubt.
   
   I know it might not look like a very clean solution but desc sort order on 
timestamp field, specialy for logs/metrics scenario are very common and it 
gives sizable improvement to those queries.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gashutos commented on issue #12448: [Performance] sort query improvement for sequential ordered data [e.g. timestamp field sort in log data]

Reply via email to