[ 
https://issues.apache.org/jira/browse/LUCENE-10375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475122#comment-17475122
 ] 

Julie Tibshirani edited comment on LUCENE-10375 at 1/13/22, 6:13 AM:
---------------------------------------------------------------------

I tried out the idea in this draft PR: https://github.com/apache/lucene/pull/601

To test performance I ran `KnnGraphTester` against the ~1 million doc GloVe 
dataset. I used the default index buffer and force merged to one segment at the 
end. Here some force merge times before and after the change. The difference is 
roughly 20%:

*Baseline*
SM 1 \[2022-01-11T20:23:15.901183385Z; Lucene Merge Thread #0\]: 981933 msec to 
merge numeric vectors [1183514 docs]
SM 1 \[2022-01-12T03:55:23.780498939Z; Lucene Merge Thread #0\]: 997296 msec to 
merge numeric vectors [1183514 docs]
SM 1 \[2022-01-13T00:59:06.759986235Z; Lucene Merge Thread #0\]: 1028748 msec 
to merge numeric vectors [1183514 docs]

*Draft PR*
SM 1 \[2022-01-12T01:20:10.329424688Z; Lucene Merge Thread #0\]: 779172 msec to 
merge numeric vectors [1183514 docs]
SM 1 \[2022-01-12T03:00:50.438375812Z; Lucene Merge Thread #0\]: 787596 msec to 
merge numeric vectors [1183514 docs]
SM 1 \[2022-01-13T02:09:15.642077242Z; Lucene Merge Thread #0\]: 795387 msec to 
merge numeric vectors [1183514 docs]

CC [~jpountz] who originally shared this idea with me (thanks!!)


was (Author: julietibs):
I tried out the idea in this draft PR: https://github.com/apache/lucene/pull/601

To test performance I ran `KnnGraphTester` against the ~1 million doc GloVe 
dataset. I used the default index buffer and force merged to one segment at the 
end. Here some force merge times before and after the change. The difference is 
roughly 20%:

*Baseline*
SM 1 [2022-01-11T20:23:15.901183385Z; Lucene Merge Thread #0]: 981933 msec to 
merge numeric vectors [1183514 docs]
SM 1 [2022-01-12T03:55:23.780498939Z; Lucene Merge Thread #0]: 997296 msec to 
merge numeric vectors [1183514 docs]
SM 1 [2022-01-13T00:59:06.759986235Z; Lucene Merge Thread #0]: 1028748 msec to 
merge numeric vectors [1183514 docs]

*Draft PR*
SM 1 [2022-01-12T01:20:10.329424688Z; Lucene Merge Thread #0]: 779172 msec to 
merge numeric vectors [1183514 docs]
SM 1 [2022-01-12T03:00:50.438375812Z; Lucene Merge Thread #0]: 787596 msec to 
merge numeric vectors [1183514 docs]
SM 1 [2022-01-13T02:09:15.642077242Z; Lucene Merge Thread #0]: 795387 msec to 
merge numeric vectors [1183514 docs]

CC [~jpountz] who originally shared this idea with me (thanks!!)

> Speed up HNSW merge by writing combined vector data
> ---------------------------------------------------
>
>                 Key: LUCENE-10375
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10375
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Julie Tibshirani
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> When merging segments together, the HNSW writer creates a VectorValues 
> instance that gives a merged view of all the segments' VectorValues. This 
> merged instance is used when constructing the new HNSW graph. Graph building 
> needs random access, and the merged VectorValues support this by mapping from 
> merged ordinals -> segments and segment ordinals.
> This mapping seems to add overhead. The nightly indexing benchmarks sometimes 
> show substantial time in Arrays.binarySearch (used to map an ordinal to a 
> segment): 
> https://blunders.io/jfr-demo/indexing-1kb-vectors-2022.01.09.18.03.19/top_down_cpu_samples.
> Instead of using a merged VectorValues to create the graph, maybe we could 
> first write all the segment vectors to a file, and use that file to build the 
> graph.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to