gashutos commented on issue #12448: URL: https://github.com/apache/lucene/issues/12448#issuecomment-1644457521
>Thinking a bit more about this optimization, I wonder if it would still work well under concurrent indexing. If I understand the optimization correctly, it relies on the fact that the n-th collected document would generally have a more competitive value than the (n-k)-th collected document to keep inserting into the circular buffer. But this wouldn't be true, e.g. under concurrent indexing if flushing segments that have (k+1) docs or more? > >For instance, assume two indexing threads that index 10 documents each between two consecutive refreshes. The first segment could have timestamps 0, 2, 4, ..., 18 and the second segment could have timestamps 1, 3, 5, ..., 19. Then when they get merged, this would create a segment whose timestamps would be 0, 2, 4, ..., 18, 1, 3, 5, ..., 19. Now if you collect the top-5 hits by descending timestamp, the optimization would automatically disable itself when it has timestamps [10, 12, 14, 16, 18] in the queue and sees timestamp 1, since 1 < 10? Yes, this wont optimize scenario where concurrent flush is invoked with very less document say 10 documents per flush. Reading at this article [concurrent flush](https://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html) (it might be old article and not the latest detail on concurrent flushing, and I hope you are talking about this as concurrent indexing), it looks like a single flish would still contain 1000s of document to be flushed if we talk about millions. This optimization would still work very well if this number of document per single flush is higher and the top(k) is smaller. i.e If single flush is 1000 and if we need top(100) in descending order, we will still able to skip 900 documents from going through heapifycation process. >For instance, assume two indexing threads that index 10 documents each between two consecutive refreshes. The first segment could have timestamps 0, 2, 4, ..., 18 and the second segment could have timestamps 1, 3, 5, ..., 19. Then when they get merged, this would create a segment whose timestamps would be 0, 2, 4, ..., 18, 1, 3, 5, ..., 19. Now if you collect the top-5 hits by descending timestamp, the optimization would automatically disable itself when it has timestamps [10, 12, 14, 16, 18] in the queue and sees timestamp 1, since 1 < 10? Like in the example you have given, lets modify it slightly like below, `assume two indexing threads that index 100 documents each between two consecutive refreshes. The first segment could have timestamps 0, 2, 4, ..., 198 and the second segment could have timestamps 1, 3, 5, ..., 190. Then when they get merged, this would create a segment whose timestamps would be 0, 2, 4, ..., 198, 1, 3, 5, ..., 197. Now if you collect the top-5 hits by descending timestamp` In this case, 1. once we have 0,2,4,6,8 in the priority queue, 10 will be insterted in circular array, similarly 12, 14, 16 & 18. Those wont go through heapifycation as we are delaying it. 2. Now 20 comes, we know order is still maintained, so we will insert 20 in circular array and remove 12 (12 is completely removed, neither it goes to priority queue). Same thing will happen for 22, 24,, ....198. 3. At this stage, we are left with 0,2,4,6,8 in priority queue & 192,194,196,198 in circular array. 4. Now 1 comes, and it breaks the sequence, so we will insert 192,194,196,198 in the priority queue and since priority queue size is 5, it will remove 0,2,4,6,8. 1 will be only present in circular array. 5. 3, 5, 7, 9 comes and those will be instered in circular array since order is intact. (Those wont pass checkThreshold() but lets assume checkThreshold() is not there.) 6. 11 comes and it will be insterted in circular array and and 5 will be removed. Same thing will keep going until we have 191, 193, 195, 197, 199 in curcular array. 7. We are done now, we need to empty circular array and insert remaining 5 element in priority queue. So we ended up skipping 190 hit going through heapification process out of 200. _Just for explanation I didnt take skipping logic in consideration else in step 4, we will have all the hits from 1 to 189 would be non-comprtitive marked by BKD point based competitive._ Thank you for reading this long explanation and I hope this makes clarity to your doubt. I know it might not look like a very clean solution but desc sort order on timestamp field, specialy for logs/metrics scenario are very common and it gives sizable improvement to those queries. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org