[
https://issues.apache.org/jira/browse/LUCENE-10375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479731#comment-17479731
]
Julie Tibshirani commented on LUCENE-10375:
-------------------------------------------
I tried simplifying by using the same logic for flush and merge:
https://github.com/apache/lucene/pull/617. I ran some benchmarks on the ~1
million doc GloVe dataset again. This time I used the default in
KnnGraphTester, which is to set a very large index buffer so that we flush a
single large segment.
**Baseline**
IW 0 [2022-01-20T02:53:54.455916827Z; main]: 821552 msec to write vectors
IW 0 [2022-01-20T04:42:49.908232952Z; main]: 833068 msec to write vectors
IW 0 [2022-01-20T06:18:45.631046686Z; main]: 826415 msec to write vectors
**New PR**
IW 0 [2022-01-20T02:30:58.451551481Z; main]: 779223 msec to write vectors
IW 0 [2022-01-20T19:22:40.702436311Z; main]: 774157 msec to write vectors
IW 0 [2022-01-20T20:07:14.082190674Z; main]: 774101 msec to write vectors
It's certainly not slower and even a little better. I am a little surprised --
[~jpountz] [[email protected]] do you have any ideas why this could speed up
flush? I'm asking for my knowledge, and also to check I don't have some mistake
in benchmarks.
> Speed up HNSW merge by writing combined vector data
> ---------------------------------------------------
>
> Key: LUCENE-10375
> URL: https://issues.apache.org/jira/browse/LUCENE-10375
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Julie Tibshirani
> Priority: Major
> Time Spent: 4h 50m
> Remaining Estimate: 0h
>
> When merging segments together, the HNSW writer creates a VectorValues
> instance that gives a merged view of all the segments' VectorValues. This
> merged instance is used when constructing the new HNSW graph. Graph building
> needs random access, and the merged VectorValues support this by mapping from
> merged ordinals -> segments and segment ordinals.
> This mapping seems to add overhead. The nightly indexing benchmarks sometimes
> show substantial time in Arrays.binarySearch (used to map an ordinal to a
> segment):
> https://blunders.io/jfr-demo/indexing-1kb-vectors-2022.01.09.18.03.19/top_down_cpu_samples.
> Instead of using a merged VectorValues to create the graph, maybe we could
> first write all the segment vectors to a file, and use that file to build the
> graph.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]