[ 
https://issues.apache.org/jira/browse/LUCENE-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani updated LUCENE-9941:
-------------------------------------
    Description: 
This is a continuation of LUCENE-9937, but for HNSW index performance.

Approaches
 * LuceneVectorsOnly: a baseline that only indexes vectors
 * LuceneHnsw: our HNSW implementation, with a force merge to one segment
 * LuceneHnswNoForceMerge: our HNSW implementation without the force merge
 * hnswlib: a C++ HNSW implementation from the author of the paper

Datasets
 * sift-128-euclidean: 1 million SIFT feature vectors, dimension 128, comparing 
euclidean distance
 * glove-100-angular: ~1.2 million GloVe word vectors, dimension 100, comparing 
cosine similarity

*Results on sift-128-euclidean*
 Parameters: M=16, efConstruction=500
{code:java}
Approach                 Index time (sec)
LuceneVectorsOnly              14.93
LuceneHnsw                   3191.16
LuceneHnswNoForceMerge       1194.31
hnswlib                       311.09
{code}
*Results on glove-100-angular*
 Parameters: M=32, efConstruction=500
{code:java}
Approach                  Index time (sec)
LuceneVectorsOnly              14.17
LuceneHnsw                   8940.41
LuceneHnswNoForceMerge       3623.68 
hnswlib                       587.23
{code}
We force merge to one segment to emulate a case where vectors aren't 
continually being indexed. In these situations, it seems likely users would 
force merge to optimize search speed: searching a single large graph is 
expected to be faster than searching several small ones serially. To see how 
long the force merge takes, we can subtract LuceneHnswNoForceMerge from 
LuceneHnsw.

The construction parameters match those in LUCENE-9937 and are optimized for 
search recall + QPS instead of index speed, as I figured this would be a common 
set-up.

Some observations:
 * In cases when segments are eventually force merged, we do a lot of extra 
work building intermediate graphs that are eventually merged away. This is a 
difficult problem, and one that's been raised in the past. As a simple step, I 
wonder if we should not build graphs for segments that are below a certain 
size. For sufficiently small segments, it could be a better trade-off to avoid 
building a graph and support nearest-neighbor search through a brute-force scan?
 * Indexing is slow compared to what we're used to for other formats, even if 
we disregard the extra work mentioned above. For sift-128-euclidean, building 
only the final graph takes ~33 min, whereas for glove-100-angular it's ~88 min.
 * As a note, graph indexing uses ANN searches in order to add each new vector 
to the graph. So the slower search speed between Lucene and hnswlib may 
contribute to slower indexing.

  was:
This is a continuation of LUCENE-9937, but for HNSW index performance.

Approaches
 * LuceneVectorsOnly: a baseline that only indexes vectors
 * LuceneHnsw: our HNSW implementation, with a force merge to one segment
 * LuceneHnswNoForceMerge: our HNSW implementation without the force merge
 * hnswlib: a C++ HNSW implementation from the author of the paper

Datasets
 * sift-128-euclidean: 1 million SIFT feature vectors, dimension 128, euclidean 
distance
 * glove-100-angular: ~1.2 million GloVe word vectors, dimension 100, euclidean 
distance

*Results on sift-128-euclidean*
 Parameters: M=16, efConstruction=500
{code:java}
Approach                 Index time (sec)
LuceneVectorsOnly              14.93
LuceneHnsw                   3191.16
LuceneHnswNoForceMerge       1194.31
hnswlib                       311.09
{code}
*Results on glove-100-angular*
 Parameters: M=32, efConstruction=500
{code:java}
Approach                  Index time (sec)
LuceneVectorsOnly              14.17
LuceneHnsw                   8940.41
LuceneHnswNoForceMerge       3623.68 
hnswlib                       587.23
{code}
We force merge to one segment to emulate a case where vectors aren't 
continually being indexed. In these situations, it seems likely users would 
force merge to optimize search speed: searching a single large graph is 
expected to be faster than searching several small ones serially. To see how 
long the force merge takes, we can subtract LuceneHnswNoForceMerge from 
LuceneHnsw.

The construction parameters match those in LUCENE-9937 and are optimized for 
search recall + QPS instead of index speed, as I figured this would be a common 
set-up.

Some observations:
 * In cases when segments are eventually force merged, we do a lot of extra 
work building intermediate graphs that are eventually merged away. This is a 
difficult problem, and one that's been raised in the past. As a simple step, I 
wonder if we should not build graphs for segments that are below a certain 
size. For sufficiently small segments, it could be a better trade-off to avoid 
building a graph and support nearest-neighbor search through a brute-force scan?
 * Indexing is slow compared to what we're used to for other formats, even if 
we disregard the extra work mentioned above. For sift-128-euclidean, building 
only the final graph takes ~33 min, whereas for glove-100-angular it's ~88 min.
 * As a note, graph indexing uses ANN searches in order to add each new vector 
to the graph. So the slower search speed between Lucene and hnswlib may 
contribute to slower indexing.


> ann-benchmarks results for HNSW indexing
> ----------------------------------------
>
>                 Key: LUCENE-9941
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9941
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Julie Tibshirani
>            Priority: Minor
>
> This is a continuation of LUCENE-9937, but for HNSW index performance.
> Approaches
>  * LuceneVectorsOnly: a baseline that only indexes vectors
>  * LuceneHnsw: our HNSW implementation, with a force merge to one segment
>  * LuceneHnswNoForceMerge: our HNSW implementation without the force merge
>  * hnswlib: a C++ HNSW implementation from the author of the paper
> Datasets
>  * sift-128-euclidean: 1 million SIFT feature vectors, dimension 128, 
> comparing euclidean distance
>  * glove-100-angular: ~1.2 million GloVe word vectors, dimension 100, 
> comparing cosine similarity
> *Results on sift-128-euclidean*
>  Parameters: M=16, efConstruction=500
> {code:java}
> Approach                 Index time (sec)
> LuceneVectorsOnly              14.93
> LuceneHnsw                   3191.16
> LuceneHnswNoForceMerge       1194.31
> hnswlib                       311.09
> {code}
> *Results on glove-100-angular*
>  Parameters: M=32, efConstruction=500
> {code:java}
> Approach                  Index time (sec)
> LuceneVectorsOnly              14.17
> LuceneHnsw                   8940.41
> LuceneHnswNoForceMerge       3623.68 
> hnswlib                       587.23
> {code}
> We force merge to one segment to emulate a case where vectors aren't 
> continually being indexed. In these situations, it seems likely users would 
> force merge to optimize search speed: searching a single large graph is 
> expected to be faster than searching several small ones serially. To see how 
> long the force merge takes, we can subtract LuceneHnswNoForceMerge from 
> LuceneHnsw.
> The construction parameters match those in LUCENE-9937 and are optimized for 
> search recall + QPS instead of index speed, as I figured this would be a 
> common set-up.
> Some observations:
>  * In cases when segments are eventually force merged, we do a lot of extra 
> work building intermediate graphs that are eventually merged away. This is a 
> difficult problem, and one that's been raised in the past. As a simple step, 
> I wonder if we should not build graphs for segments that are below a certain 
> size. For sufficiently small segments, it could be a better trade-off to 
> avoid building a graph and support nearest-neighbor search through a 
> brute-force scan?
>  * Indexing is slow compared to what we're used to for other formats, even if 
> we disregard the extra work mentioned above. For sift-128-euclidean, building 
> only the final graph takes ~33 min, whereas for glove-100-angular it's ~88 
> min.
>  * As a note, graph indexing uses ANN searches in order to add each new 
> vector to the graph. So the slower search speed between Lucene and hnswlib 
> may contribute to slower indexing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to