msokolov commented on PR #12962:
URL: https://github.com/apache/lucene/pull/12962#issuecomment-1963097347
HNSW stands for "hierarchical navigable small world" - that should make it
easy to remember :)
--
This is an automated message from the Apache Git Service.
To respond to the message,
uschindler commented on PR #12962:
URL: https://github.com/apache/lucene/pull/12962#issuecomment-1962440478
Is it called HNSW or HNWS? I just noticed the title of this PR and differing
changes entries.
--
This is an automated message from the Apache Git Service.
To respond to the message,
benwtrent commented on PR #12962:
URL: https://github.com/apache/lucene/pull/12962#issuecomment-1961350051
So cool. We are now faster at 768 dimensions than we were on 100 dimensions.
⚡ ⚡ ⚡ ⚡ ⚡
--
This is an automated message from the Apache Git Service.
To respond to the message,
mayya-sharipova merged PR #12962:
URL: https://github.com/apache/lucene/pull/12962
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscr...@lu
benwtrent commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1478196742
##
lucene/join/src/java/org/apache/lucene/search/join/DiversifyingChildrenByteKnnVectorQuery.java:
##
@@ -24,15 +24,8 @@
import org.apache.lucene.index.LeafReaderCo
mayya-sharipova commented on PR #12962:
URL: https://github.com/apache/lucene/pull/12962#issuecomment-1924675805
@benwtrent @jimczi Thanks for your great feedback, with the latest changes I
tried to address your comments, please check if need something else.
--
This is an automated messag
mayya-sharipova commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1476713535
##
lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java:
##
@@ -27,25 +29,78 @@
*/
public final class TopKnnCollector extends AbstractKnnColle
mayya-sharipova commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1476532664
##
lucene/core/src/java/org/apache/lucene/search/knn/KnnCollectorManager.java:
##
@@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) und
mayya-sharipova commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1476712873
##
lucene/core/src/java/org/apache/lucene/search/knn/KnnCollectorManager.java:
##
@@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) und
mayya-sharipova commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1476712594
##
lucene/core/src/java/org/apache/lucene/index/LeafReader.java:
##
@@ -277,27 +273,24 @@ public final TopDocs searchNearestVectors(
*
* @param field t
mayya-sharipova commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1476712259
##
lucene/core/src/java/org/apache/lucene/index/LeafReader.java:
##
@@ -236,27 +235,24 @@ public final PostingsEnum postings(Term term) throws
IOException {
mayya-sharipova commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1476532664
##
lucene/core/src/java/org/apache/lucene/search/knn/KnnCollectorManager.java:
##
@@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) und
jimczi commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1475878963
##
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##
@@ -79,24 +83,32 @@ public Query rewrite(IndexSearcher indexSearcher) throws
IOExcepti
tveasey commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1475846853
##
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##
@@ -79,24 +83,32 @@ public Query rewrite(IndexSearcher indexSearcher) throws
IOExcept
benwtrent commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1475060677
##
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##
@@ -79,24 +83,32 @@ public Query rewrite(IndexSearcher indexSearcher) throws
IOExce
benwtrent commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1475057976
##
lucene/core/src/java/org/apache/lucene/search/knn/KnnCollectorManager.java:
##
@@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
mayya-sharipova commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1474964972
##
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##
@@ -79,24 +83,32 @@ public Query rewrite(IndexSearcher indexSearcher) throws
jimczi commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1474802862
##
lucene/core/src/java/org/apache/lucene/index/LeafReader.java:
##
@@ -236,27 +235,24 @@ public final PostingsEnum postings(Term term) throws
IOException {
*
mayya-sharipova commented on PR #12962:
URL: https://github.com/apache/lucene/pull/12962#issuecomment-1919701631
I've re-ran the sets o with latest changes on this PR (candidate) and main
branch (baseline):
I have also done experiments using Cohere dataset, as as seen below:
- for
tveasey commented on PR #12962:
URL: https://github.com/apache/lucene/pull/12962#issuecomment-1919650070
> @jimczi @tveasey I've addressed your comments. Are we ok to merge as it is
now.
I'm happy
--
This is an automated message from the Apache Git Service.
To respond to the messag
benwtrent commented on PR #12962:
URL: https://github.com/apache/lucene/pull/12962#issuecomment-1919489643
> but I think we need to run more experiments on smaller dims datasets as
well, how about we leave this for the follow up?
I am 100% fine with this. It was a crazy idea and it on
mayya-sharipova commented on PR #12962:
URL: https://github.com/apache/lucene/pull/12962#issuecomment-1919480036
@benwtrent Thanks for running additional tests. Looks like running with
dynamic `k` can speed up searches, but I think we need to run more experiments
on smaller dims datasets as
mayya-sharipova commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1473009083
##
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##
@@ -79,24 +82,30 @@ public Query rewrite(IndexSearcher indexSearcher) throws
mayya-sharipova commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1473008106
##
lucene/core/src/java/org/apache/lucene/index/LeafReader.java:
##
@@ -280,12 +289,20 @@ public final TopDocs searchNearestVectors(
* @param k the number
benwtrent commented on PR #12962:
URL: https://github.com/apache/lucene/pull/12962#issuecomment-1919297975
I fixed my data and ran with 1.5M cohere:
static_k is this PR
dynamic_k is this PR + scaling the `k` explored by
```
loat v = (float)Math.log(sumVectorCount / (do
benwtrent commented on PR #12962:
URL: https://github.com/apache/lucene/pull/12962#issuecomment-1912157192
I ran my own experiment, which showed some interesting and frustrating
results.
I adjusted the indexing to randomly commit() on every 500 docs or so. I
indexed the first 10M do
benwtrent commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1466483832
##
lucene/core/src/java/org/apache/lucene/index/LeafReader.java:
##
@@ -280,12 +289,20 @@ public final TopDocs searchNearestVectors(
* @param k the number of doc
jimczi commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1456027720
##
lucene/core/src/java/org/apache/lucene/index/LeafReader.java:
##
@@ -240,11 +241,19 @@ public final PostingsEnum postings(Term term) throws
IOException {
* @par
tveasey commented on PR #12962:
URL: https://github.com/apache/lucene/pull/12962#issuecomment-1895920963
> This makes an interval of 255 a reasonable choice.
I agree. This looks better to me. One thing I would be intrigued to try is
the slight change in schedules as per
[this](https:
tveasey commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1455672480
##
lucene/core/src/java/org/apache/lucene/util/hnsw/BlockingFloatHeap.java:
##
@@ -0,0 +1,190 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or
mayya-sharipova commented on PR #12962:
URL: https://github.com/apache/lucene/pull/12962#issuecomment-1895612417
> It's also important to check the order of execution. For instance what
happens if all segments are executed serially (rather than in parallel), does
it changes the recall?
mayya-sharipova commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1454345777
##
lucene/core/src/java/org/apache/lucene/util/hnsw/BlockingFloatHeap.java:
##
@@ -0,0 +1,190 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under
mayya-sharipova commented on PR #12962:
URL: https://github.com/apache/lucene/pull/12962#issuecomment-1894402548
I have done more experiments with different `interval` values:
Cohere 786 dims:
1M vectors, k=10, fanout=90
| Interval | Avg visited nodes | QPS| Recall
tveasey commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1453341457
##
lucene/core/src/java/org/apache/lucene/util/hnsw/BlockingFloatHeap.java:
##
@@ -0,0 +1,190 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or
tveasey commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1453323532
##
lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java:
##
@@ -27,25 +29,72 @@
*/
public final class TopKnnCollector extends AbstractKnnCollector {
tveasey commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1453309333
##
lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java:
##
@@ -27,25 +29,72 @@
*/
public final class TopKnnCollector extends AbstractKnnCollector {
tveasey commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1453309333
##
lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java:
##
@@ -27,25 +29,72 @@
*/
public final class TopKnnCollector extends AbstractKnnCollector {
jimczi commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1453088407
##
lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java:
##
@@ -27,25 +29,72 @@
*/
public final class TopKnnCollector extends AbstractKnnCollector {
mayya-sharipova commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1452703248
##
lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java:
##
@@ -27,25 +29,72 @@
*/
public final class TopKnnCollector extends AbstractKnnColle
mayya-sharipova commented on PR #12962:
URL: https://github.com/apache/lucene/pull/12962#issuecomment-1892646998
@jimczi Thanks for your feedback.
> Some of the recalls for the single segment baseline seem seem quite off
(0.477?). Are you sure that there was no issue during the testin
jimczi commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1452096401
##
lucene/core/src/java/org/apache/lucene/index/LeafReader.java:
##
@@ -240,11 +241,19 @@ public final PostingsEnum postings(Term term) throws
IOException {
* @par
mayya-sharipova commented on PR #12962:
URL: https://github.com/apache/lucene/pull/12962#issuecomment-1889934399
I have also done experiments using Cohere dataset, as as seen below for 10M
docs dataset, the speedups with the proposed approach are 1.7-2.5x times.
## Cohere/wikipedia-22
tveasey commented on PR #12962:
URL: https://github.com/apache/lucene/pull/12962#issuecomment-1873862125
IMO we shouldn't focus too much on recall since the greediness of
non-competitive search allows us to tune this. My main concern is does
contention on the queue updates cause slow down.
mayya-sharipova commented on PR #12962:
URL: https://github.com/apache/lucene/pull/12962#issuecomment-1866829988
### 1M vectors of 100 dims
k=10, fanout=90
| |Avg visited nodes | QPS| Recall|
| :--- |---
mayya-sharipova opened a new pull request, #12962:
URL: https://github.com/apache/lucene/pull/12962
A second implementation of #12794 using Queue instead of MaxScoreAccumulator.
Speedup concurrent multi-segment HNWS graph search by exchanging the global
top scores collected so far ac
45 matches
Mail list logo