gashutos commented on issue #12448: URL: https://github.com/apache/lucene/issues/12448#issuecomment-1643604606
@jpountz for looking at this. >Lazily heapifying sounds interesting, and thanks for sharing performance numbers when data occurs in random order. Do you also have performance numbers for the case when the index sort is the opposite order compared to the query sort? I'm curious how much this optimization can save in that case since this is what you're trying to optimize. For this, I have inserted `36 million` documents with below schema, keeping `@timestamp` long field in `asc` order almost. ``` { "@timestamp": 898459201, "clientip": "211.11.9.0", "request": "GET /english/index.html HTTP/1.0", "status": 304, "size": 0 } ``` Below were the results, units are in `ms`, when I query `top(K)` in descending order. where k = 5, 100, 1000.... before my changes, top(K) | pq time | total time | -- | -- | -- | -- 5 hits | 4 | 58 | 100 hits | 70 | 130 | 1000 hits | 95 | 197 | after my changes, top(K) | pq time | total time | -- | -- | -- | -- 5 hits | 0 | 47 | 100 hits | 1 | 59 | 1000 hits | 2 | 101 | The higher the value of `k` the higher heapifycation time is going to take. >You might also be interested in checking out this [paper](https://www.vldb.org/pvldb/vol15/p3472-yu.pdf) where Tencent describes optimizations that they made for a similar problem in section 4.5.2: they configure an index sort by ascending timestamp on their data, but still want to be able to perform both queries by ascending timestamp and descending timestamp. To handle the case when the index sort and the query sort are opposite, they query on exponentially growing windows of documents that match the end of the doc ID space. Yeah, I have read this paper :) The main thing which I kind of felt bottleneck there is they use `index sort` in timestamp field which leads higher indexing latencies. That's a trade off there. I like our current implementation without `index sort` and use BKD points to skip non-competitive hits. And if we add this `Leazy heapifycation` as proposed here, that will give good advantage on current implementation itself. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org