Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-17 Thread via GitHub


mayya-sharipova commented on PR #12962:
URL: https://github.com/apache/lucene/pull/12962#issuecomment-1895612417

   > It's also important to check the order of execution. For instance what 
happens if all segments are executed serially (rather than in parallel), does 
it changes the recall?
   
   @jimczi  The results are below
   
   
   ---
   Sequential run cohere: 768 dims; interval: 255
   
   ### 1M vectors 
   k=10, fanout=90
   
   | |Avg visited nodes |   QPS|   Recall| 
   |  :---   |---:  | ---: |---: |  
   | Baseline Single segment |  2033|  1212|0.880|  
 
   | Baseline 8 segments sequential  | 13927|   175|0.974|  
 
   | Candidate_with_queue sequential |  9407|   269|0.923|
   
   k=100, fanout=900
   | |Avg visited nodes |   QPS|   Recall| 
   |  :---   |---:  | ---: |---: |  
   | Baseline Single segment | 13083|   167|0.964|  
 
   | Baseline 8 segments sequential  | 81824|28|0.997|  
 
   | Candidate_with_queue sequential | 36851|62|0.983|
   
   ### 10M vectors 
   k=10, fanout=90
   
   |  |Avg visited nodes |   QPS|   Recall| 
   |  :---|---:  | ---: |---: | 
 
   | Baseline Single segment  |  2189|   984|0.929| 
  
   | Baseline 19 segments sequential  | 37656|59|0.951| 
  
   | Candidate_with_queue sequential  | 17748|   121|0.884|
   
   
   k=100, fanout=900
   | |Avg visited nodes |   QPS|   Recall| 
   |  :---   |---:  | ---: |---: |  
   | Baseline Single segment | 15040|   122|0.950|  
 
   | Baseline 19 segments sequential |229945| 9|0.990|  
 
   | Candidate_with_queue sequential | 77910|27|0.964|
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix BWC test generation after mondernizing LineDocFile [lucene]

2024-01-17 Thread via GitHub


jpountz commented on code in PR #13021:
URL: https://github.com/apache/lucene/pull/13021#discussion_r1455491681


##
lucene/backward-codecs/src/test/org/apache/lucene/backward_index/TestBackwardsCompatibility.java:
##
@@ -219,22 +228,48 @@ public void testCreateMoreTermsIndex() throws Exception {
 new 
IndexWriterConfig(analyzer).setMergePolicy(mp).setUseCompoundFile(false);
 IndexWriter writer = new IndexWriter(dir, conf);
 LineFileDocs docs = new LineFileDocs(new Random(0));
+Field docIdDV = null;
+Field titleDVField = null;
 for (int i = 0; i < 50; i++) {
-  writer.addDocument(docs.nextDoc());
+  Document doc = docs.nextDoc();
+  if (docIdDV == null) {
+docIdDV = new NumericDocValuesField("docid_intDV", 0);
+doc.add(docIdDV);
+  }

Review Comment:
   Nit: this seems to assume that the `Document` instance that is returned is 
always the same, which feels a bit fragile. Maybe do a shallow copy of the 
document all the time? Or use `TestUtil.cloneDocument()`?



##
lucene/backward-codecs/src/test/org/apache/lucene/backward_index/TestBackwardsCompatibility.java:
##
@@ -204,11 +204,20 @@ private Path getIndexDir() {
   }
 
   public void testCreateMoreTermsIndex() throws Exception {
-
 Path indexDir = getIndexDir().resolve("moreterms");
 Files.deleteIfExists(indexDir);
-Directory dir = newFSDirectory(indexDir);
+try (Directory dir = newFSDirectory(indexDir)) {
+  createMoreTermsIndex(dir);
+}
+  }
+
+  public void testCreateMoreTermsIndexInternal() throws Exception {
+try (Directory dir = newDirectory()) {
+  createMoreTermsIndex(dir);
+}
+  }
 
+  public void createMoreTermsIndex(Directory dir) throws Exception {

Review Comment:
   nit: it doesn't seem to need to be public?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix BWC test generation after mondernizing LineDocFile [lucene]

2024-01-17 Thread via GitHub


s1monw commented on code in PR #13021:
URL: https://github.com/apache/lucene/pull/13021#discussion_r1455526064


##
lucene/backward-codecs/src/test/org/apache/lucene/backward_index/TestBackwardsCompatibility.java:
##
@@ -219,22 +228,48 @@ public void testCreateMoreTermsIndex() throws Exception {
 new 
IndexWriterConfig(analyzer).setMergePolicy(mp).setUseCompoundFile(false);
 IndexWriter writer = new IndexWriter(dir, conf);
 LineFileDocs docs = new LineFileDocs(new Random(0));
+Field docIdDV = null;
+Field titleDVField = null;
 for (int i = 0; i < 50; i++) {
-  writer.addDocument(docs.nextDoc());
+  Document doc = docs.nextDoc();
+  if (docIdDV == null) {
+docIdDV = new NumericDocValuesField("docid_intDV", 0);
+doc.add(docIdDV);
+  }

Review Comment:
   yeah, so we are doing this in the other tests too bef0r my change. That's 
why i did it that way... I can clean this all up in a followup



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix BWC test generation after mondernizing LineDocFile [lucene]

2024-01-17 Thread via GitHub


jpountz commented on code in PR #13021:
URL: https://github.com/apache/lucene/pull/13021#discussion_r1455533959


##
lucene/backward-codecs/src/test/org/apache/lucene/backward_index/TestBackwardsCompatibility.java:
##
@@ -219,22 +228,48 @@ public void testCreateMoreTermsIndex() throws Exception {
 new 
IndexWriterConfig(analyzer).setMergePolicy(mp).setUseCompoundFile(false);
 IndexWriter writer = new IndexWriter(dir, conf);
 LineFileDocs docs = new LineFileDocs(new Random(0));
+Field docIdDV = null;
+Field titleDVField = null;
 for (int i = 0; i < 50; i++) {
-  writer.addDocument(docs.nextDoc());
+  Document doc = docs.nextDoc();
+  if (docIdDV == null) {
+docIdDV = new NumericDocValuesField("docid_intDV", 0);
+doc.add(docIdDV);
+  }

Review Comment:
   Sounds good!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix BWC test generation after mondernizing LineDocFile [lucene]

2024-01-17 Thread via GitHub


s1monw merged PR #13021:
URL: https://github.com/apache/lucene/pull/13021


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-17 Thread via GitHub


tveasey commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1455672480


##
lucene/core/src/java/org/apache/lucene/util/hnsw/BlockingFloatHeap.java:
##
@@ -0,0 +1,190 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.hnsw;
+
+import java.util.concurrent.locks.ReentrantLock;
+
+/**
+ * A blocking bounded min heap that stores floats. The top element is the 
lowest value of the heap.
+ *
+ * A primitive priority queue that maintains a partial ordering of its 
elements such that the
+ * least element can always be found in constant time. Implementation is based 
on {@link
+ * org.apache.lucene.util.LongHeap}
+ *
+ * @lucene.internal
+ */
+public final class BlockingFloatHeap {
+  private final int maxSize;
+  private final float[] heap;
+  private final ReentrantLock lock;
+  private int size;
+
+  public BlockingFloatHeap(int maxSize) {
+this.maxSize = maxSize;
+this.heap = new float[maxSize + 1];
+this.lock = new ReentrantLock();
+this.size = 0;
+  }
+
+  /**
+   * Inserts a value into this heap.
+   *
+   * If the number of values would exceed the heap's maxSize, the least 
value is discarded
+   *
+   * @param value the value to add
+   * @return the new 'top' element in the queue.
+   */
+  public float offer(float value) {
+lock.lock();
+try {
+  if (size < maxSize) {
+push(value);
+return heap[1];
+  } else {
+if (value >= heap[1]) {
+  updateTop(value);
+}
+return heap[1];
+  }
+} finally {
+  lock.unlock();
+}
+  }
+
+  /**
+   * Inserts array of values into this heap.
+   *
+   * Values are expected to be sorted in ascending order.
+   *
+   * @param values a set of values to insert
+   * @return the new 'top' element in the queue.
+   */
+  public float offer(float[] values) {
+lock.lock();
+try {
+  for (int i = values.length - 1; i >= 0; i--) {
+if (size < maxSize) {
+  push(values[i]);
+} else {
+  if (values[i] >= heap[1]) {

Review Comment:
   But I mean we don't need to keep iterating the loop, i.e. we keep iterating 
all the way down to 0 but know we will never push anything, instead we can just 
exit immediately. This is all done under a lock so it seems best to avoid this 
work, even if it is just the fast path of the loop.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-17 Thread via GitHub


tveasey commented on PR #12962:
URL: https://github.com/apache/lucene/pull/12962#issuecomment-1895920963

   > This makes an interval of 255 a reasonable choice.
   
   I agree. This looks better to me. One thing I would be intrigued to try is 
the slight change in schedules as per 
[this](https://github.com/apache/lucene/pull/12962#discussion_r1453323532). 
Particularly, what happens if we delay using the information in 
`minCompetitiveSimilarity`. However, these results are already very good and we 
could push them out to a follow on PR.
   
   One last thing I think we should consider is exploring the variance we get 
in recall as a result of this change. Specifically, if we were to run with some 
random waits in the different segment searches what is the variation in the 
recalls we see?
   
   The danger is we get unlikely in ordering of searches and prune segment 
searches which contain the true nearest neighbours more aggressively sometimes. 
On average this shouldn't happen, but if we also see low variation in recall 
for the 1M test case in such a test case it would reassuring.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Cut over HNSW's neighbor lists to group-varint? [lucene]

2024-01-17 Thread via GitHub


easyice commented on issue #12871:
URL: https://github.com/apache/lucene/issues/12871#issuecomment-1896075935

   Hi Adrien, i tried the idea, this will got a ~10% speedup on JHM output, 
which is slightly less because it needs  an extra loop to decode the delta.
   
   The benchmark using neighbor list = 32 (number of vints will be read), JMH 
output on java21:
   
   ```
   Benchmark  (baseline)   Mode  Cnt  Score   Error   Units
   NeighborListsBenchMark.doSeektrue  thrpt5  0.026 ± 0.002  ops/us
   NeighborListsBenchMark.doSeek   false  thrpt5  0.029 ± 0.001  ops/us
   ```
   
   The POC code is 
[here](https://github.com/easyice/lucene/commit/81954223abcf787e85ac6c4a1184e922db29a3cc)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Enable CheckIndex to exorcise segments with missing segment infos (.si) (#7820) [lucene]

2024-01-17 Thread via GitHub


github-actions[bot] commented on PR #12872:
URL: https://github.com/apache/lucene/pull/12872#issuecomment-1897533423

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org