Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-02-25 Thread via GitHub
msokolov commented on PR #12962: URL: https://github.com/apache/lucene/pull/12962#issuecomment-1963097347 HNSW stands for "hierarchical navigable small world" - that should make it easy to remember :) -- This is an automated message from the Apache Git Service. To respond to the message,

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-02-24 Thread via GitHub
uschindler commented on PR #12962: URL: https://github.com/apache/lucene/pull/12962#issuecomment-1962440478 Is it called HNSW or HNWS? I just noticed the title of this PR and differing changes entries. -- This is an automated message from the Apache Git Service. To respond to the message,

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-02-23 Thread via GitHub
benwtrent commented on PR #12962: URL: https://github.com/apache/lucene/pull/12962#issuecomment-1961350051 So cool. We are now faster at 768 dimensions than we were on 100 dimensions. ⚡ ⚡ ⚡ ⚡ ⚡ -- This is an automated message from the Apache Git Service. To respond to the message,

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-02-06 Thread via GitHub
mayya-sharipova merged PR #12962: URL: https://github.com/apache/lucene/pull/12962 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lu

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-02-05 Thread via GitHub
benwtrent commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1478196742 ## lucene/join/src/java/org/apache/lucene/search/join/DiversifyingChildrenByteKnnVectorQuery.java: ## @@ -24,15 +24,8 @@ import org.apache.lucene.index.LeafReaderCo

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-02-02 Thread via GitHub
mayya-sharipova commented on PR #12962: URL: https://github.com/apache/lucene/pull/12962#issuecomment-1924675805 @benwtrent @jimczi Thanks for your great feedback, with the latest changes I tried to address your comments, please check if need something else. -- This is an automated messag

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-02-02 Thread via GitHub
mayya-sharipova commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1476713535 ## lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java: ## @@ -27,25 +29,78 @@ */ public final class TopKnnCollector extends AbstractKnnColle

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-02-02 Thread via GitHub
mayya-sharipova commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1476532664 ## lucene/core/src/java/org/apache/lucene/search/knn/KnnCollectorManager.java: ## @@ -0,0 +1,38 @@ +/* + * Licensed to the Apache Software Foundation (ASF) und

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-02-02 Thread via GitHub
mayya-sharipova commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1476712873 ## lucene/core/src/java/org/apache/lucene/search/knn/KnnCollectorManager.java: ## @@ -0,0 +1,38 @@ +/* + * Licensed to the Apache Software Foundation (ASF) und

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-02-02 Thread via GitHub
mayya-sharipova commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1476712594 ## lucene/core/src/java/org/apache/lucene/index/LeafReader.java: ## @@ -277,27 +273,24 @@ public final TopDocs searchNearestVectors( * * @param field t

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-02-02 Thread via GitHub
mayya-sharipova commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1476712259 ## lucene/core/src/java/org/apache/lucene/index/LeafReader.java: ## @@ -236,27 +235,24 @@ public final PostingsEnum postings(Term term) throws IOException {

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-02-02 Thread via GitHub
mayya-sharipova commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1476532664 ## lucene/core/src/java/org/apache/lucene/search/knn/KnnCollectorManager.java: ## @@ -0,0 +1,38 @@ +/* + * Licensed to the Apache Software Foundation (ASF) und

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-02-02 Thread via GitHub
jimczi commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1475878963 ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -79,24 +83,32 @@ public Query rewrite(IndexSearcher indexSearcher) throws IOExcepti

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-02-02 Thread via GitHub
tveasey commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1475846853 ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -79,24 +83,32 @@ public Query rewrite(IndexSearcher indexSearcher) throws IOExcept

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-02-01 Thread via GitHub
benwtrent commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1475060677 ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -79,24 +83,32 @@ public Query rewrite(IndexSearcher indexSearcher) throws IOExce

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-02-01 Thread via GitHub
benwtrent commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1475057976 ## lucene/core/src/java/org/apache/lucene/search/knn/KnnCollectorManager.java: ## @@ -0,0 +1,38 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-02-01 Thread via GitHub
mayya-sharipova commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1474964972 ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -79,24 +83,32 @@ public Query rewrite(IndexSearcher indexSearcher) throws

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-02-01 Thread via GitHub
jimczi commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1474802862 ## lucene/core/src/java/org/apache/lucene/index/LeafReader.java: ## @@ -236,27 +235,24 @@ public final PostingsEnum postings(Term term) throws IOException { *

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-31 Thread via GitHub
mayya-sharipova commented on PR #12962: URL: https://github.com/apache/lucene/pull/12962#issuecomment-1919701631 I've re-ran the sets o with latest changes on this PR (candidate) and main branch (baseline): I have also done experiments using Cohere dataset, as as seen below: - for

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-31 Thread via GitHub
tveasey commented on PR #12962: URL: https://github.com/apache/lucene/pull/12962#issuecomment-1919650070 > @jimczi @tveasey I've addressed your comments. Are we ok to merge as it is now. I'm happy -- This is an automated message from the Apache Git Service. To respond to the messag

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-31 Thread via GitHub
benwtrent commented on PR #12962: URL: https://github.com/apache/lucene/pull/12962#issuecomment-1919489643 > but I think we need to run more experiments on smaller dims datasets as well, how about we leave this for the follow up? I am 100% fine with this. It was a crazy idea and it on

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-31 Thread via GitHub
mayya-sharipova commented on PR #12962: URL: https://github.com/apache/lucene/pull/12962#issuecomment-1919480036 @benwtrent Thanks for running additional tests. Looks like running with dynamic `k` can speed up searches, but I think we need to run more experiments on smaller dims datasets as

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-31 Thread via GitHub
mayya-sharipova commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1473009083 ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -79,24 +82,30 @@ public Query rewrite(IndexSearcher indexSearcher) throws

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-31 Thread via GitHub
mayya-sharipova commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1473008106 ## lucene/core/src/java/org/apache/lucene/index/LeafReader.java: ## @@ -280,12 +289,20 @@ public final TopDocs searchNearestVectors( * @param k the number

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-31 Thread via GitHub
benwtrent commented on PR #12962: URL: https://github.com/apache/lucene/pull/12962#issuecomment-1919297975 I fixed my data and ran with 1.5M cohere: static_k is this PR dynamic_k is this PR + scaling the `k` explored by ``` loat v = (float)Math.log(sumVectorCount / (do

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-26 Thread via GitHub
benwtrent commented on PR #12962: URL: https://github.com/apache/lucene/pull/12962#issuecomment-1912157192 I ran my own experiment, which showed some interesting and frustrating results. I adjusted the indexing to randomly commit() on every 500 docs or so. I indexed the first 10M do

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-25 Thread via GitHub
benwtrent commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1466483832 ## lucene/core/src/java/org/apache/lucene/index/LeafReader.java: ## @@ -280,12 +289,20 @@ public final TopDocs searchNearestVectors( * @param k the number of doc

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-18 Thread via GitHub
jimczi commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1456027720 ## lucene/core/src/java/org/apache/lucene/index/LeafReader.java: ## @@ -240,11 +241,19 @@ public final PostingsEnum postings(Term term) throws IOException { * @par

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-17 Thread via GitHub
tveasey commented on PR #12962: URL: https://github.com/apache/lucene/pull/12962#issuecomment-1895920963 > This makes an interval of 255 a reasonable choice. I agree. This looks better to me. One thing I would be intrigued to try is the slight change in schedules as per [this](https:

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-17 Thread via GitHub
tveasey commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1455672480 ## lucene/core/src/java/org/apache/lucene/util/hnsw/BlockingFloatHeap.java: ## @@ -0,0 +1,190 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-17 Thread via GitHub
mayya-sharipova commented on PR #12962: URL: https://github.com/apache/lucene/pull/12962#issuecomment-1895612417 > It's also important to check the order of execution. For instance what happens if all segments are executed serially (rather than in parallel), does it changes the recall?

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-16 Thread via GitHub
mayya-sharipova commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1454345777 ## lucene/core/src/java/org/apache/lucene/util/hnsw/BlockingFloatHeap.java: ## @@ -0,0 +1,190 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-16 Thread via GitHub
mayya-sharipova commented on PR #12962: URL: https://github.com/apache/lucene/pull/12962#issuecomment-1894402548 I have done more experiments with different `interval` values: Cohere 786 dims: 1M vectors, k=10, fanout=90 | Interval | Avg visited nodes | QPS| Recall

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-16 Thread via GitHub
tveasey commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1453341457 ## lucene/core/src/java/org/apache/lucene/util/hnsw/BlockingFloatHeap.java: ## @@ -0,0 +1,190 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-16 Thread via GitHub
tveasey commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1453323532 ## lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java: ## @@ -27,25 +29,72 @@ */ public final class TopKnnCollector extends AbstractKnnCollector {

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-16 Thread via GitHub
tveasey commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1453309333 ## lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java: ## @@ -27,25 +29,72 @@ */ public final class TopKnnCollector extends AbstractKnnCollector {

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-16 Thread via GitHub
tveasey commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1453309333 ## lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java: ## @@ -27,25 +29,72 @@ */ public final class TopKnnCollector extends AbstractKnnCollector {

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-16 Thread via GitHub
jimczi commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1453088407 ## lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java: ## @@ -27,25 +29,72 @@ */ public final class TopKnnCollector extends AbstractKnnCollector {

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-15 Thread via GitHub
mayya-sharipova commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1452703248 ## lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java: ## @@ -27,25 +29,72 @@ */ public final class TopKnnCollector extends AbstractKnnColle

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-15 Thread via GitHub
mayya-sharipova commented on PR #12962: URL: https://github.com/apache/lucene/pull/12962#issuecomment-1892646998 @jimczi Thanks for your feedback. > Some of the recalls for the single segment baseline seem seem quite off (0.477?). Are you sure that there was no issue during the testin

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-15 Thread via GitHub
jimczi commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1452096401 ## lucene/core/src/java/org/apache/lucene/index/LeafReader.java: ## @@ -240,11 +241,19 @@ public final PostingsEnum postings(Term term) throws IOException { * @par

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-12 Thread via GitHub
mayya-sharipova commented on PR #12962: URL: https://github.com/apache/lucene/pull/12962#issuecomment-1889934399 I have also done experiments using Cohere dataset, as as seen below for 10M docs dataset, the speedups with the proposed approach are 1.7-2.5x times. ## Cohere/wikipedia-22

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-02 Thread via GitHub
tveasey commented on PR #12962: URL: https://github.com/apache/lucene/pull/12962#issuecomment-1873862125 IMO we shouldn't focus too much on recall since the greediness of non-competitive search allows us to tune this. My main concern is does contention on the queue updates cause slow down.

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2023-12-21 Thread via GitHub
mayya-sharipova commented on PR #12962: URL: https://github.com/apache/lucene/pull/12962#issuecomment-1866829988 ### 1M vectors of 100 dims k=10, fanout=90 | |Avg visited nodes | QPS| Recall| | :--- |---

[PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2023-12-21 Thread via GitHub
mayya-sharipova opened a new pull request, #12962: URL: https://github.com/apache/lucene/pull/12962 A second implementation of #12794 using Queue instead of MaxScoreAccumulator. Speedup concurrent multi-segment HNWS graph search by exchanging the global top scores collected so far ac