[GitHub] [lucene] zhaih opened a new issue, #12236: Lazily compute similarity score when reuse the old graph
zhaih opened a new issue, #12236: URL: https://github.com/apache/lucene/issues/12236 ### Description In #12050 we added ability to reuse old graph as an initializer to speed up the merge, but even when we are re-inserting the old graph's node, we still need to calculate a similarity score [here](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java#L205) so that we can pop-out the worst non-diverse node [here](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java#L393) based on a sorted sequence. But since the score is only used for diversity checking purpose, we probably do not even need them in some cases (like when we never reach the level connection limit). So we probably can first insert those nodes without calculating the score, then when we eventually need to pop a worst node, we can calculate the score, sort the neighbor array and then do the normal "find the worst node" procedure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] sherman commented on issue #12203: Scalable merge/compaction of big doc values segments.
sherman commented on issue #12203: URL: https://github.com/apache/lucene/issues/12203#issuecomment-151513 Hi, @mikemccand! Hi, Mike! I re-run the tests. My hardware: Model name: Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz 96 cores. The test was quite simple: rewriting all doc values of a specific index segment, which is similar to what we do when we run a compaction process. So, in this test I had a segment with 8.021.709 documents and the following statistics of doc values fields (yes, we have a lot of doc values fields): SORTED_NUMERIC=1649 SORTED=1 SORTED_SET=4863 The total size of .dvd file is: 8.3G The [baseline](https://gist.github.com/sherman/85e8eec254c27247c377736316dc4f57) (single thread) took 249 seconds. The [parallel](https://gist.github.com/sherman/f0b066354180baba02f4514104ede881) test (32 threads) took 19 seconds. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org