[GitHub] [lucene] zhaih opened a new issue, #12236: Lazily compute similarity score when reuse the old graph

2023-04-19 Thread via GitHub


zhaih opened a new issue, #12236:
URL: https://github.com/apache/lucene/issues/12236

   ### Description
   
   In #12050 we added ability to reuse old graph as an initializer to speed up 
the merge, but even when we are re-inserting the old graph's node, we still 
need to calculate a similarity score 
[here](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java#L205)
 so that we can pop-out the worst non-diverse node 
[here](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java#L393)
 based on a sorted sequence.
   But since the score is only used for diversity checking purpose, we probably 
do not even need them in some cases (like when we  never reach the level 
connection limit). So we probably can first insert those nodes without 
calculating the score, then when we eventually need to pop a worst node, we can 
calculate the score, sort the neighbor array and then do the normal "find the 
worst node" procedure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] sherman commented on issue #12203: Scalable merge/compaction of big doc values segments.

2023-04-19 Thread via GitHub


sherman commented on issue #12203:
URL: https://github.com/apache/lucene/issues/12203#issuecomment-151513

   Hi, @mikemccand!
   
   Hi, Mike!
   
   I re-run the tests.
   
   My hardware:
   Model name: Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz
   96 cores.
   
   The test was quite simple: rewriting all doc values of a specific index 
segment, which is similar to what we do when we run a compaction process.
   
   So, in this test I had a segment with 8.021.709 documents and the following 
statistics of doc values fields (yes, we have a lot of doc values fields):
   
   SORTED_NUMERIC=1649
   SORTED=1
   SORTED_SET=4863
   
   The total size of .dvd file is: 8.3G
   The 
[baseline](https://gist.github.com/sherman/85e8eec254c27247c377736316dc4f57) 
(single thread) took 249 seconds.
   
   The 
[parallel](https://gist.github.com/sherman/f0b066354180baba02f4514104ede881) 
test (32 threads) took 19 seconds.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org