Re: [PR] Make TaskExecutor cx public and use TaskExecutor for concurrent HNSW graph build [lucene]

2023-11-17 Thread via GitHub


shubhamvishu commented on code in PR #12799:
URL: https://github.com/apache/lucene/pull/12799#discussion_r1396959558


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswConcurrentMergeBuilder.java:
##
@@ -77,42 +75,17 @@ public OnHeapHnswGraph build(int maxOrd) throws IOException 
{
   HNSW_COMPONENT,
   "build graph from " + maxOrd + " vectors, with " + workers.length + 
" workers");
 }
-List> futures = new ArrayList<>();
+List> futures = new ArrayList<>();
 for (int i = 0; i < workers.length; i++) {
   int finalI = i;
   futures.add(
-  exec.submit(
-  () -> {
-try {
-  workers[finalI].run(maxOrd);
-} catch (IOException e) {
-  throw new RuntimeException(e);
-}
-  }));
-}
-Throwable exc = null;
-for (Future future : futures) {
-  try {
-future.get();
-  } catch (InterruptedException e) {
-var newException = new ThreadInterruptedException(e);
-if (exc == null) {
-  exc = newException;
-} else {
-  exc.addSuppressed(newException);
-}
-  } catch (ExecutionException e) {
-if (exc == null) {
-  exc = e.getCause();
-} else {
-  exc.addSuppressed(e.getCause());
-}
-  }
-}
-if (exc != null) {
-  // The error handling was copied from TaskExecutor. should we just use 
TaskExecutor instead?
-  throw IOUtils.rethrowAlways(exc);
+  () -> {
+workers[finalI].run(maxOrd);
+return null;
+  });
 }
+TaskExecutor taskExecutor = new TaskExecutor(exec);

Review Comment:
   I think with this we are only creating 1 task executor instance per knn 
field because the concurrent graph building happens in 
`HnswConcurrentMergeBuilder#build` so moving the task executor creation above 
in the hierarchy(to `ConcurrentHnswMerger`) shouldn't make any difference as we 
would still be creating 1 instance per knn field or maybe I'm wrong?
   
   I'm happy to change it if this is not the case or if you feel thats still 
better approach to pass the task executor from above instead of executor 
service?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Make TaskExecutor cx public and use TaskExecutor for concurrent HNSW graph build [lucene]

2023-11-17 Thread via GitHub


shubhamvishu commented on code in PR #12799:
URL: https://github.com/apache/lucene/pull/12799#discussion_r1396959558


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswConcurrentMergeBuilder.java:
##
@@ -77,42 +75,17 @@ public OnHeapHnswGraph build(int maxOrd) throws IOException 
{
   HNSW_COMPONENT,
   "build graph from " + maxOrd + " vectors, with " + workers.length + 
" workers");
 }
-List> futures = new ArrayList<>();
+List> futures = new ArrayList<>();
 for (int i = 0; i < workers.length; i++) {
   int finalI = i;
   futures.add(
-  exec.submit(
-  () -> {
-try {
-  workers[finalI].run(maxOrd);
-} catch (IOException e) {
-  throw new RuntimeException(e);
-}
-  }));
-}
-Throwable exc = null;
-for (Future future : futures) {
-  try {
-future.get();
-  } catch (InterruptedException e) {
-var newException = new ThreadInterruptedException(e);
-if (exc == null) {
-  exc = newException;
-} else {
-  exc.addSuppressed(newException);
-}
-  } catch (ExecutionException e) {
-if (exc == null) {
-  exc = e.getCause();
-} else {
-  exc.addSuppressed(e.getCause());
-}
-  }
-}
-if (exc != null) {
-  // The error handling was copied from TaskExecutor. should we just use 
TaskExecutor instead?
-  throw IOUtils.rethrowAlways(exc);
+  () -> {
+workers[finalI].run(maxOrd);
+return null;
+  });
 }
+TaskExecutor taskExecutor = new TaskExecutor(exec);

Review Comment:
   I think with this we are only creating 1 task executor instance per knn 
field because the concurrent graph building happens in 
`HnswConcurrentMergeBuilder#build`(where we make use of concurrency) so moving 
the task executor creation above in the hierarchy(to `ConcurrentHnswMerger`) 
shouldn't make any difference as we would still be creating 1 instance per knn 
field or maybe I'm wrong?
   
   I'm happy to change it if this is not the case or if you feel thats still 
better approach to pass the task executor from above instead of executor 
service?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-9951: Add InfoStream to ReplicationService [lucene]

2023-11-17 Thread via GitHub


mikemccand commented on PR #124:
URL: https://github.com/apache/lucene/pull/124#issuecomment-1816528624

   OK thank you for bringing closure @ChristophKaser.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Dry up DirectReader implementations [lucene]

2023-11-17 Thread via GitHub


original-brownbear opened a new pull request, #12823:
URL: https://github.com/apache/lucene/pull/12823

   This can be written in a much drier way that shouldn't come at any 
performance cost as far as I can see.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Improve hash mixing in FST's double-barrel LRU hash [lucene]

2023-11-17 Thread via GitHub


mikemccand commented on PR #12716:
URL: https://github.com/apache/lucene/pull/12716#issuecomment-1816539511

   Thanks @shubhamvishu and @dweiss and @bruno-roustant.  Hashing is fun and 
hard :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix segmentInfos replace doesn't set userData [lucene]

2023-11-17 Thread via GitHub


Shibi-bala commented on PR #12626:
URL: https://github.com/apache/lucene/pull/12626#issuecomment-1816600417

   @uschindler hey, thanks for the approval! Read the contributing guidelines, 
but not entirely sure how to get permissions to merge this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix segmentInfos replace doesn't set userData [lucene]

2023-11-17 Thread via GitHub


uschindler commented on PR #12626:
URL: https://github.com/apache/lucene/pull/12626#issuecomment-1816614768

   You can't do it.
   
   Please add a Changes entry unter the 9.9 section, commit it to branch and I 
will merge and Backport your PR.
   
   I am just away from my computer at moment, sorry vor the delay.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-10241: Updating OpenNLP to 1.9.4. [lucene]

2023-11-17 Thread via GitHub


cpoerschke merged PR #448:
URL: https://github.com/apache/lucene/pull/448


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Update OpenNLP to 1.9.4 [LUCENE-10241] [lucene]

2023-11-17 Thread via GitHub


cpoerschke closed issue #11277: Update OpenNLP to 1.9.4 [LUCENE-10241]
URL: https://github.com/apache/lucene/issues/11277


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Update OpenNLP to 1.9.4 [LUCENE-10241] [lucene]

2023-11-17 Thread via GitHub


cpoerschke commented on issue #11277:
URL: https://github.com/apache/lucene/issues/11277#issuecomment-1816701100

   #448 is the merged `main` branch pull request and 
https://github.com/apache/lucene/commit/b8094d49aaf5e5cb5182c0307e25eafa2d332dda
 is the `branch_9x` commit.
   
   Thanks @jzonthemtn!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Re-use information from graph traversal during exact search [lucene]

2023-11-17 Thread via GitHub


kaivalnp commented on PR #12820:
URL: https://github.com/apache/lucene/pull/12820#issuecomment-1816720340

   Thanks @jpountz! I realised something from your comment:
   
   My current implementation has a flaw, because it cannot handle the 
[`OrdinalTranslatedKnnCollector`](https://github.com/kaivalnp/lucene/blob/03624754eb3ccf9da114ff5fc358cc230466ea3f/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java#L232)
 correctly: The 
[`setVisited`](https://github.com/kaivalnp/lucene/blob/reuse-graph-search/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L253)
 call has the `BitSet visited` as packed ordinals, but the 
[`getVisited`](https://github.com/kaivalnp/lucene/blob/reuse-graph-search/lucene/core/src/java/org/apache/lucene/util/hnsw/OrdinalTranslatedKnnCollector.java#L91)
 call receives a `docId` (and not a `vectorId`) so we would need a reverse 
[`IntToIntFunction 
docIdToVectorOrdinal`](https://github.com/kaivalnp/lucene/blob/reuse-graph-search/lucene/core/src/java/org/apache/lucene/util/hnsw/OrdinalTranslatedKnnCollector.java#L31)
 to map it back to an ordinal
   
   This is straightforward for `DenseOffHeapVectorValues` or 
`EmptyOffHeapVectorValues` (because there is a 1-1 mapping between a doc and 
ordinal) but becomes a problem for `SparseOffHeapVectorValues` which has the 
`vectorOrdinaltoDocId` implemented as a 
[`DirectMonotonicReader`](https://github.com/kaivalnp/lucene/blob/reuse-graph-search/lucene/core/src/java/org/apache/lucene/codecs/lucene95/OffHeapFloatVectorValues.java#L139)
 - which I think is docIds stored one after another - so getting the docId for 
an ordinal is as simple as a lookup at that offset. However, getting an inverse 
of this can become costly (binary search -> returning the index) as opposed to 
the current constant time lookup
   
   I wonder how costly it would be to maintain the set of visited docs at the 
`KnnCollector` like you mentioned (perhaps using a `SparseFixedBitSet`)? We 
[already create a 
`BitSet`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L121)
 of `maxDoc` length to hold the filtered docs..
   
   In the worst case, we would need another `BitSet` of the same length to 
store which docs are visited from graph search, then skip over those from 
`#exactSearch`. However, there may be a better opportunity here: since we want 
to go over docs that are "prefiltered but not visited", can we simply `#clear` 
the bits whenever we visit a node - we just need to find a way to do this 
cleanly?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix segmentInfos replace doesn't set userData [lucene]

2023-11-17 Thread via GitHub


MarcusSorealheis commented on PR #12626:
URL: https://github.com/apache/lucene/pull/12626#issuecomment-1816799469

   @Shibi-bala It's here: 
   
   
https://github.com/apache/lucene/blob/c228e4bb66ca73c8150d8eaebe2bb999bcc6c9b1/lucene/CHANGES.txt#L147
   
   You need to include your user and the GitHub user(s) that reviewed 
it/assisted as seen in the other entries.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix segmentInfos replace doesn't set userData [lucene]

2023-11-17 Thread via GitHub


Shibi-bala commented on PR #12626:
URL: https://github.com/apache/lucene/pull/12626#issuecomment-1816818775

   Made the changes. Thanks @uschindler @MarcusSorealheis  @msfroh  😁 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix CheckIndex to detect major corruption with old (not the latest) commit point [lucene]

2023-11-17 Thread via GitHub


mikemccand merged PR #12530:
URL: https://github.com/apache/lucene/pull/12530


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix segmentInfos replace doesn't set userData [lucene]

2023-11-17 Thread via GitHub


uschindler merged PR #12626:
URL: https://github.com/apache/lucene/pull/12626


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] segmentInfos.replace() doesn't set userData [lucene]

2023-11-17 Thread via GitHub


uschindler closed issue #12637: segmentInfos.replace() doesn't set userData
URL: https://github.com/apache/lucene/issues/12637


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] CheckIndex cannot "fix" indexes that have individual segments with missing or corrupt .si files because sanity checks will fail trying to read the index initially. [LUCENE-6762] [lucene]

2023-11-17 Thread via GitHub


mikemccand commented on issue #7820:
URL: https://github.com/apache/lucene/issues/7820#issuecomment-1816857195

   I merged the first step in this issue -- detecting when this unique 
snowflake form of corruption strikes.
   
   Step 2 is to enable exorcism when there is an `_X.si` file missing, which is 
tricky because the exception thrown (while loading `SegmentInfos`) does not 
make it clear which segment (s) are broken.  @gokaai I think you had some ideas 
on how to do this?  If we improved the thrown exception to add a member 
recording which segment is broken, `CheckIndex` could catch that and know 
exactly which segment to exorcise.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-17 Thread via GitHub


jpountz commented on PR #12782:
URL: https://github.com/apache/lucene/pull/12782#issuecomment-1817146436

   Thanks @easyice. I took some time to look into the benchmark and improve a 
few things, hopefully you don't mind. Here is the output of the benchmark on my 
machine now:
   
   ```
   Benchmark   (numBytesPerInt)  (size)   Mode  
Cnt   Score   Error   Units
   GroupVIntBenchmark.byteArrayReadGroupVInt  1  64  thrpt  
  5  24.483 ± 0.345  ops/us
   GroupVIntBenchmark.byteArrayReadGroupVInt  2  64  thrpt  
  5  23.346 ± 0.288  ops/us
   GroupVIntBenchmark.byteArrayReadGroupVInt  3  64  thrpt  
  5  16.318 ± 0.062  ops/us
   GroupVIntBenchmark.byteArrayReadGroupVInt  4  64  thrpt  
  5  24.748 ± 0.993  ops/us
   GroupVIntBenchmark.byteArrayReadVInt   1  64  thrpt  
  5  17.767 ± 0.081  ops/us
   GroupVIntBenchmark.byteArrayReadVInt   2  64  thrpt  
  5   7.256 ± 0.013  ops/us
   GroupVIntBenchmark.byteArrayReadVInt   3  64  thrpt  
  5   5.546 ± 0.449  ops/us
   GroupVIntBenchmark.byteArrayReadVInt   4  64  thrpt  
  5   4.475 ± 0.021  ops/us
   GroupVIntBenchmark.byteBufferReadGroupVInt 1  64  thrpt  
  5  21.812 ± 0.485  ops/us
   GroupVIntBenchmark.byteBufferReadGroupVInt 2  64  thrpt  
  5  20.623 ± 1.454  ops/us
   GroupVIntBenchmark.byteBufferReadGroupVInt 3  64  thrpt  
  5  13.601 ± 0.299  ops/us
   GroupVIntBenchmark.byteBufferReadGroupVInt 4  64  thrpt  
  5  22.649 ± 0.662  ops/us
   GroupVIntBenchmark.byteBufferReadVInt  1  64  thrpt  
  5  22.147 ± 0.083  ops/us
   GroupVIntBenchmark.byteBufferReadVInt  2  64  thrpt  
  5   8.072 ± 0.116  ops/us
   GroupVIntBenchmark.byteBufferReadVInt  3  64  thrpt  
  5   4.554 ± 0.394  ops/us
   GroupVIntBenchmark.byteBufferReadVInt  4  64  thrpt  
  5   4.145 ± 0.674  ops/us
   ```
   
   The benchmark used to read directly from the in-memory byte[] by calling 
`rewind()` , I changed that to force it to read directly from the directoly to 
make the comparison a bit more fair.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use group-varint encoding for the tail of postings [lucene]

2023-11-17 Thread via GitHub


jpountz commented on code in PR #12782:
URL: https://github.com/apache/lucene/pull/12782#discussion_r1391047570


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/GroupVIntWriter.java:
##
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene99;
+
+import java.io.IOException;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * Encode integers using group-varint. It uses VInt to encode tail values that 
are not enough for a
+ * group
+ */
+class GroupVIntWriter {
+  private int[] buffer = new int[4];
+  private byte[] bytes = new byte[16];
+  private int byteOffset = 0;
+  private int bufferOffset = 0;
+  private final IndexOutput output;
+
+  public GroupVIntWriter(IndexOutput output) {
+this.output = output;
+  }
+
+  public void add(int v) throws IOException {
+buffer = ArrayUtil.grow(buffer, bufferOffset + 1);
+buffer[bufferOffset++] = v;
+  }
+
+  public void reset(int numValues) {

Review Comment:
   Let's remove the `numValues` parameter since it looks like we can't actually 
rely on it?



##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/GroupVIntReader.java:
##
@@ -0,0 +1,69 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene99;
+
+import java.io.IOException;
+import org.apache.lucene.store.DataInput;
+
+/**
+ * Decode integers using group-varint. It will fully read the bytes for the 
block, to avoid repeated
+ * expensive bounds checking per readBytes.
+ */
+public class GroupVIntReader {
+  DataInput in;
+
+  public GroupVIntReader() {}
+
+  /** Called when decode a new block. */
+  public void reset(DataInput indexInput) throws IOException {
+this.in = indexInput;
+  }
+
+  /** only readValues or nextInt can be called after reset */
+  public void readValues(long[] docs, int limit) throws IOException {

Review Comment:
   I'm not sure we need both a reset() and readValues(), maybe `readValues()` 
could take a `DataInput` directly?



##
lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/GroupVIntBenchmark.java:
##
@@ -0,0 +1,150 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.benchmark.jmh;
+
+import java.io.IOException;
+import java.nio.file.Files;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.concurrent.ThreadLocalRandom;
+import java.util.concurrent.TimeUnit;
+import org.apache.lucene.codecs.lucene99.GroupVIntReader;
+import org.apache.lucene.codecs.lucene99.GroupVIntWriter;
+import org.apache.lucene.store.ByteArrayDataInput;
+import org.apache.l

Re: [PR] Log number of visited nodes in knn query [lucene]

2023-11-17 Thread via GitHub


jpountz commented on PR #12819:
URL: https://github.com/apache/lucene/pull/12819#issuecomment-1817157857

   Logging doesn't sound like a good fit for this, would it be better exposed 
e.g. via the profiling query?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adding HumanReadableQuery with a descrition param, used for debugging print output [lucene]

2023-11-17 Thread via GitHub


jpountz commented on code in PR #12816:
URL: https://github.com/apache/lucene/pull/12816#discussion_r1397929325


##
lucene/misc/src/java/org/apache/lucene/misc/search/HumanReadableQuery.java:
##
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.misc.search;
+
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.QueryVisitor;
+import org.apache.lucene.search.ScoreMode;
+import org.apache.lucene.search.Weight;
+
+/**
+ * A simple query wrapper for debug purposes. Behaves like the given query, 
but when printing to a
+ * string, it will prepend the description parameter to the query output.
+ */
+public final class HumanReadableQuery extends Query {
+
+  private final Query in;
+  private final String description;
+
+  /**
+   * Create a new HumanReadableQuery
+   *
+   * @param in the query to wrap
+   * @param description a human-readable description, used in toString()
+   */
+  public HumanReadableQuery(Query in, String description) {
+this.in = in;
+this.description = description;
+  }
+
+  /**
+   * @return the wrapped Query
+   */
+  public Query getWrappedQuery() {
+return in;
+  }
+
+  /**
+   * @return the query description
+   */
+  public String getDescription() {
+return description;
+  }
+
+  @Override
+  public Query rewrite(IndexSearcher indexSearcher) {
+return in;
+  }
+
+  @Override
+  public String toString(String field) {
+return this.getDescription() + ":" + in.toString(field);
+  }
+
+  @Override
+  public void visit(QueryVisitor visitor) {
+in.visit(visitor);
+  }
+
+  @Override
+  public boolean equals(Object other) {
+return sameClassAs(other) && in.equals(((HumanReadableQuery) other).in);
+  }
+
+  @Override
+  public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, 
float boost) {
+throw new UnsupportedOperationException("HumanReadableQuery does not 
support #createWeight()");
+  }
+
+  @Override
+  public int hashCode() {
+return in.hashCode();

Review Comment:
   Nit: Ideally a query and the same query wrapped in a `HumanReadableQuery` 
would have different hashcodes to avoid collisions.
   
   ```suggestion
   return 31 * classHash() + in.hashCode();
   ```



##
lucene/misc/src/test/org/apache/lucene/misc/search/TestHumanReadableQuery.java:
##
@@ -0,0 +1,132 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.misc.search;
+
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.search.BooleanClause;
+import org.apache.lucene.search.BooleanQuery;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.KnnFloatVectorQuery;
+import org.apache.lucene.search.PhraseQuery;
+import org.apache.lucene.search.TermQuery;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.tests.analysis.MockAnalyzer;
+import org.apache.lucene.tests.analysis.MockTokenizer;
+import org.apache.lucene.tests.index.RandomIndexWriter;
+import org.apache.lucene.tests.search.CheckHits;
+import org.apache.lucene.tests.util.English;
+import org.apache.lucene.tests.util.LuceneTestCase;
+import org.apache.lucene.tests.util.TestUtil;
+import org.junit.AfterClass;
+import org.junit.BeforeClass;
+
+public 

Re: [PR] Speedup concurrent multi-segment HNWS graph search [lucene]

2023-11-17 Thread via GitHub


vigyasharma commented on code in PR #12794:
URL: https://github.com/apache/lucene/pull/12794#discussion_r1397994430


##
lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java:
##
@@ -26,26 +26,71 @@
  * @lucene.experimental
  */
 public final class TopKnnCollector extends AbstractKnnCollector {
+  private static final float DEFAULT_GREEDINESS = 0.9f;
 
   private final NeighborQueue queue;
+  private final float greediness;
+  private final NeighborQueue queueg;
+  private final MaxScoreAccumulator globalMinSimAcc;
+  private boolean kResultsCollected = false;
+  private float cachedGlobalMinSim = Float.NEGATIVE_INFINITY;
+
+  // greediness of globally non-competitive search: [0,1]
 
   /**
* @param k the number of neighbors to collect
* @param visitLimit how many vector nodes the results are allowed to visit
+   * @param globalMinSimAcc the global minimum competitive similarity tracked 
across all segments
*/
-  public TopKnnCollector(int k, int visitLimit) {
+  public TopKnnCollector(int k, int visitLimit, MaxScoreAccumulator 
globalMinSimAcc) {
+super(k, visitLimit);
+this.greediness = DEFAULT_GREEDINESS;
+this.queue = new NeighborQueue(k, false);
+int queuegSize = Math.max(1, Math.round((1 - greediness) * k));
+this.queueg = new NeighborQueue(queuegSize, false);
+this.globalMinSimAcc = globalMinSimAcc;
+  }
+
+  public TopKnnCollector(
+  int k, int visitLimit, MaxScoreAccumulator globalMinSimAcc, float 
greediness) {
 super(k, visitLimit);
+this.greediness = greediness;
 this.queue = new NeighborQueue(k, false);
+this.queueg = new NeighborQueue(Math.round((1 - greediness) * k), false);
+this.globalMinSimAcc = globalMinSimAcc;
   }
 
   @Override
   public boolean collect(int docId, float similarity) {
-return queue.insertWithOverflow(docId, similarity);
+boolean result = queue.insertWithOverflow(docId, similarity);
+queueg.insertWithOverflow(docId, similarity);
+
+boolean reachedKResults = (kResultsCollected == false && queue.size() == 
k());
+if (reachedKResults) {
+  kResultsCollected = true;
+}
+if (globalMinSimAcc != null && kResultsCollected) {
+  // as we've collected k results, we can start exchanging globally
+  globalMinSimAcc.accumulate(queue.topNode(), queue.topScore());
+
+  // periodically update the local copy of global similarity
+  if (reachedKResults || (visitedCount & globalMinSimAcc.modInterval) == 
0) {
+MaxScoreAccumulator.DocAndScore docAndScore = globalMinSimAcc.get();
+cachedGlobalMinSim = docAndScore.score;

Review Comment:
   We only peek at the global minimum similarity after we have accumulated `k` 
local results. For my understanding, what would happen if we were to start 
using `globalMinSim` as soon as any leaf has collected its top K results?
   
   I'm curious how the graph structure and entry point calculation affects 
this. Looking at `graphSearcher.findBestEntryPoint(...)`, i think we navigate 
to the closest node we can find to our query on layer-1 and use that as the 
starting point for layer-0. Now layer-0 is a lot more dense than layer-1, and 
our best candidates are likely in the space between the entry-point on layer-1 
and all its neighbors. 
   
   Is it always a good idea to at least look at k nodes around the entry point, 
regardless of the global min. competitive score?



##
lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java:
##
@@ -26,26 +26,71 @@
  * @lucene.experimental
  */
 public final class TopKnnCollector extends AbstractKnnCollector {
+  private static final float DEFAULT_GREEDINESS = 0.9f;

Review Comment:
   So `greediness` is essentially how "greedy" our algorithm for picking top 
matches gets to be. 
   At 1, we go with the global min competitive similarity. And by default, we 
continue to search the segment as long as we find neighbors that are better 
than the top 10% of our local picks from the segment `(1-0.9)*k)`.
   
   Should we add some documentation around this param? My initial, naive 
impression was that higher greediness might mean we pick more nodes per 
segment, which is quite the opposite. :) 



##
lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java:
##
@@ -26,26 +26,71 @@
  * @lucene.experimental
  */
 public final class TopKnnCollector extends AbstractKnnCollector {
+  private static final float DEFAULT_GREEDINESS = 0.9f;
 
   private final NeighborQueue queue;
+  private final float greediness;
+  private final NeighborQueue queueg;
+  private final MaxScoreAccumulator globalMinSimAcc;
+  private boolean kResultsCollected = false;
+  private float cachedGlobalMinSim = Float.NEGATIVE_INFINITY;
+
+  // greediness of globally non-competitive search: [0,1]
 
   /**
* @param k the number of neighbors to collect
* @param visitLimit how many vector nodes the results are allowed to visit
+   * @param globalMi

Re: [PR] Speedup concurrent multi-segment HNWS graph search [lucene]

2023-11-17 Thread via GitHub


vigyasharma commented on PR #12794:
URL: https://github.com/apache/lucene/pull/12794#issuecomment-1817274998

   We seem to consistently see an improvement in recall between single segment, 
and multi-segment runs (both seq and conc.) on baseline. Is this because with 
multiple segments, we get multiple entry points into the overall graph? Whereas 
in a single merged segment, we only have access to a sparser set of nodes in 
layer-1 while finding the single best entry point?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup concurrent multi-segment HNWS graph search [lucene]

2023-11-17 Thread via GitHub


vigyasharma commented on PR #12794:
URL: https://github.com/apache/lucene/pull/12794#issuecomment-1817282807

   Do you have a mental model on what kind of graphs would see minimal loss of 
recall between baseline and candidate? Is this change better with denser 
(higher fanout) graphs? Would it depend on graph construction params like 
`beam-width` ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adding HumanReadableQuery with a descrition param, used for debugging print output [lucene]

2023-11-17 Thread via GitHub


slow-J commented on code in PR #12816:
URL: https://github.com/apache/lucene/pull/12816#discussion_r1398045523


##
lucene/misc/src/test/org/apache/lucene/misc/search/TestHumanReadableQuery.java:
##
@@ -0,0 +1,132 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.misc.search;
+
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.search.BooleanClause;
+import org.apache.lucene.search.BooleanQuery;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.KnnFloatVectorQuery;
+import org.apache.lucene.search.PhraseQuery;
+import org.apache.lucene.search.TermQuery;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.tests.analysis.MockAnalyzer;
+import org.apache.lucene.tests.analysis.MockTokenizer;
+import org.apache.lucene.tests.index.RandomIndexWriter;
+import org.apache.lucene.tests.search.CheckHits;
+import org.apache.lucene.tests.util.English;
+import org.apache.lucene.tests.util.LuceneTestCase;
+import org.apache.lucene.tests.util.TestUtil;
+import org.junit.AfterClass;
+import org.junit.BeforeClass;
+
+public class TestHumanReadableQuery extends LuceneTestCase {
+  private static IndexSearcher searcher;
+  private static IndexReader reader;
+  private static Directory directory;
+  private static final String field = "field";
+
+  @BeforeClass
+  public static void beforeClass() throws Exception {
+directory = newDirectory();
+RandomIndexWriter writer =
+new RandomIndexWriter(
+random(),
+directory,
+newIndexWriterConfig(new MockAnalyzer(random(), 
MockTokenizer.SIMPLE, true))
+.setMaxBufferedDocs(TestUtil.nextInt(random(), 100, 1000))
+.setMergePolicy(newLogMergePolicy()));
+// writer.infoStream = System.out;
+for (int i = 0; i < 1000; i++) {
+  Document doc = new Document();
+  doc.add(newTextField(field, English.intToEnglish(i), Field.Store.YES));
+  writer.addDocument(doc);
+}
+reader = writer.getReader();
+searcher = newSearcher(reader);
+writer.close();
+  }
+
+  @AfterClass
+  public static void afterClass() throws Exception {
+reader.close();
+directory.close();
+searcher = null;
+reader = null;
+directory = null;
+  }
+
+  public void testHitsEqualPhraseQuery() throws Exception {
+PhraseQuery pQuery = new PhraseQuery(field, "seventy", "seven");
+CheckHits.checkHits(
+random(),
+pQuery,
+field,
+searcher,
+new int[] {77, 177, 277, 377, 477, 577, 677, 777, 877, 977});
+
+String description = "TestPhraseQuery";
+HumanReadableQuery hQuery = new HumanReadableQuery(pQuery, description);
+CheckHits.checkHits(
+random(),
+hQuery,
+field,
+searcher,
+new int[] {77, 177, 277, 377, 477, 577, 677, 777, 877, 977});
+assertEquals(hQuery.toString(), description + ":" + pQuery);

Review Comment:
   Thanks, switching the params now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adding HumanReadableQuery with a descrition param, used for debugging print output [lucene]

2023-11-17 Thread via GitHub


slow-J commented on PR #12816:
URL: https://github.com/apache/lucene/pull/12816#issuecomment-1817340280

   > I left minor comments but it looks good to me otherwise!
   
   Thanks for the feedback! Done the changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix Field.java documentation to refer to new IntField/FloatField/LongField/DoubleField #12125 [lucene]

2023-11-17 Thread via GitHub


jpountz commented on PR #12821:
URL: https://github.com/apache/lucene/pull/12821#issuecomment-1817419020

   Thanks for doing it, it looks like the PR includes unintended changes though?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove delayed seek optimization. [lucene]

2023-11-17 Thread via GitHub


jpountz merged PR #12815:
URL: https://github.com/apache/lucene/pull/12815


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adding HumanReadableQuery with a descrition param, used for debugging print output [lucene]

2023-11-17 Thread via GitHub


jpountz merged PR #12816:
URL: https://github.com/apache/lucene/pull/12816


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Can/should `KnnByte/FloatVectorQuery` carry some human-meaningful opaque `toString` fragment? [lucene]

2023-11-17 Thread via GitHub


jpountz closed issue #12487: Can/should `KnnByte/FloatVectorQuery` carry some 
human-meaningful opaque `toString` fragment?
URL: https://github.com/apache/lucene/issues/12487


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org