Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]

2024-11-12 Thread via GitHub


tteofili commented on PR #13984:
URL: https://github.com/apache/lucene/pull/13984#issuecomment-2469922060

   this might also check for graph connectivity, see #12627 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Improve checksum calculations [lucene]

2024-11-12 Thread via GitHub


jfboeuf opened a new pull request, #13989:
URL: https://github.com/apache/lucene/pull/13989

   Take advantage of the existing buffer in `BufferedChecksum` to speed up 
reads for Longs, Ints, Shorts, and Long arrays by avoiding byte-by-byte reads.
   Use the faster `readLongs()` method to decode live docs and bloom filters.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]

2024-11-12 Thread via GitHub


benchaplin commented on code in PR #13984:
URL: https://github.com/apache/lucene/pull/13984#discussion_r1838595160


##
lucene/core/src/java/org/apache/lucene/index/CheckIndex.java:
##
@@ -2746,6 +2769,84 @@ public static Status.VectorValuesStatus testVectors(
 return status;
   }
 
+  private static HnswGraph getHnswGraph(CodecReader reader) throws IOException 
{
+KnnVectorsReader vectorsReader = reader.getVectorReader();
+if (vectorsReader instanceof PerFieldKnnVectorsFormat.FieldsReader) {
+  vectorsReader = ((PerFieldKnnVectorsFormat.FieldsReader) 
vectorsReader).getFieldReader("knn");

Review Comment:
   Thanks, I didn't quite understand fields when I wrote this - I think I get 
it now. Alright, I've done what you suggested (as is also done in 
`testVectors`) and iterated over `FieldInfos`, performing the check only when 
it applies. 
   
   Because we might now parse several HNSW graphs, I've restructured the status 
object to support per-graph data. Successful output will now look like:
   
   ```
   test: open reader.OK [took 0.010 sec]
   test: check integrity.OK [took 2.216 sec]
   test: check live docs.OK [took 0.000 sec]
   test: field infos.OK [2 fields] [took 0.000 sec]
   test: field norms.OK [0 fields] [took 0.000 sec]
   test: terms, freq, prox...test: stored fields...OK [150 
total field count; avg 1.0 fields per doc] [took 0.390 sec]
   test: term vectorsOK [0 total term vector count; avg 0.0 
term/freq vector fields per doc] [took 0.000 sec]
   test: docvalues...OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 
SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET; 0 SKIPPING INDEX] [took 0.000 sec]
   test: points..OK [0 fields, 0 points] [took 0.000 sec]
   test: vectors.OK [1 fields, 150 vectors] [took 0.496 sec]
   test: hnsw graphs.OK [2 fields: (field name: knn1, levels: 4, 
total nodes: 1547684), (field name: knn2, levels: 4, total nodes: 1547684)] 
[took 0.979 sec]
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Update lastDoc in ScoreCachingWrappingScorer [lucene]

2024-11-12 Thread via GitHub


msfroh commented on PR #13987:
URL: https://github.com/apache/lucene/pull/13987#issuecomment-2471331326

   Luckily, this is a Lucene 10-only bug (from when `docId()` was removed from 
`Scorable`). 
   
   I came across it when updating OpenSearch to support Lucene 10 and needed to 
refactor some code that used `ScoreCachingWrapperScorer`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Add a Multi-Vector Similarity Function [lucene]

2024-11-12 Thread via GitHub


vigyasharma opened a new pull request, #13991:
URL: https://github.com/apache/lucene/pull/13991

   This is a small first change towards adding support for multi-vectors. We 
start with adding a `MultiVectorSimilarityFunction` that can handle (late) 
interaction across multiple vector values.
   
   This is the first of a series of splits for the larger prototype PR #13525 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add a Multi-Vector Similarity Function [lucene]

2024-11-12 Thread via GitHub


vigyasharma commented on PR #13991:
URL: https://github.com/apache/lucene/pull/13991#issuecomment-2472019578

   I am thinking we can leverage the `NONE` aggregation (in #13525) for 
non-ColBERT passage vector use-cases, by making each graph node correspond to a 
single value in the multi-vector i.e. index time aggregation becomes "none". 
The resultant graph would be similar to what we construct with parent-child 
docs today, while flat storage with multi-vectors could allow for aggregated 
similarity checks at *query* time. This could help with recall while making 
mutli-vector usage easier to use (no overquery or index time joins).
   
   It'll need some work: a mechanism to address each vector value directly, and 
corresponding changes in VectorValues. I'm thinking maybe an "ordinal" for the 
multi-vector, and a "sub-ordinal" for values within the multi-vector. Both ints 
can be packed into a long for node value?
   
   Since I haven't chalked out all the details yet, I decided to **remove** the 
`NONE` aggregation for now, and focus first on the ColBERT use case. Will go 
with the `isMultiVector` flag in FieldInfos to identify multi-vector storage 
requirements.
   
   cc: @cpoerschke , @benwtrent 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Update lastDoc in ScoreCachingWrappingScorer [lucene]

2024-11-12 Thread via GitHub


msfroh commented on PR #13987:
URL: https://github.com/apache/lucene/pull/13987#issuecomment-2472044035

   I think the caching worked fine, albeit in a funny way. 
   
   When you call `collect()` with any valid doc ID, it invalidates the cache, 
causing scores to be computed. After the score was computed, `currDoc` was set 
to `-1`.  Repeated calls to `score()` (until the next `collect()`) will reuse 
the cached value because `-1 == -1`. 
   
   As long as you don't call `collect()` on the same document multiple times 
(which you probably shouldn't be doing anyway -- those existing unit tests were 
a little weird), it should be okay.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]

2024-11-12 Thread via GitHub


benchaplin commented on code in PR #13984:
URL: https://github.com/apache/lucene/pull/13984#discussion_r1838568093


##
lucene/core/src/java/org/apache/lucene/index/CheckIndex.java:
##
@@ -2746,6 +2769,84 @@ public static Status.VectorValuesStatus testVectors(
 return status;
   }
 
+  private static HnswGraph getHnswGraph(CodecReader reader) throws IOException 
{
+KnnVectorsReader vectorsReader = reader.getVectorReader();
+if (vectorsReader instanceof PerFieldKnnVectorsFormat.FieldsReader) {
+  vectorsReader = ((PerFieldKnnVectorsFormat.FieldsReader) 
vectorsReader).getFieldReader("knn");
+  if (vectorsReader instanceof HnswGraphProvider) {
+return ((HnswGraphProvider) vectorsReader).getGraph("knn");
+  }
+}
+return null;
+  }
+
+  /** Test the HNSW graph. */
+  public static Status.HnswGraphStatus testHnswGraph(
+  CodecReader reader, PrintStream infoStream, boolean failFast) throws 
IOException {
+if (infoStream != null) {
+  infoStream.print("test: hnsw graph..");
+}
+long startNS = System.nanoTime();
+Status.HnswGraphStatus status = new Status.HnswGraphStatus();
+
+try {
+  HnswGraph hnswGraph = getHnswGraph(reader);
+  if (hnswGraph != null) {
+final int numLevels = hnswGraph.numLevels();
+// Perform tests on each level of the HNSW graph
+for (int level = numLevels - 1; level >= 0; level--) {
+  HnswGraph.NodesIterator nodesIterator = 
hnswGraph.getNodesOnLevel(level);
+  while (nodesIterator.hasNext()) {
+int node = nodesIterator.nextInt();
+hnswGraph.seek(level, node);
+int nbr, lastNeighbor = -1, firstNeighbor = -1;
+while ((nbr = hnswGraph.nextNeighbor()) != NO_MORE_DOCS) {
+  if (firstNeighbor == -1) {
+firstNeighbor = nbr;
+  }
+  if (nbr < lastNeighbor) {
+throw new CheckIndexException(
+"Neighbors out of order for node "
++ node
++ ": "
++ nbr
++ "<"
++ lastNeighbor
++ " 1st="
++ firstNeighbor);
+  } else if (nbr == lastNeighbor) {
+throw new CheckIndexException(
+"There are repeated neighbors of node " + node + " with 
value " + nbr);
+  }
+  lastNeighbor = nbr;
+}
+status.hnswGraphSize++;
+  }
+  status.hsnwGraphNumLevels++;

Review Comment:
   I could, but wanted to respect the object as a "status" - so if a check 
fails, `numLevels` will indicate on which level it failed. However we don't 
print the status if an exception is thrown. What do you think? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Introduces IndexInput#updateReadAdvice to change the readadvice while [lucene]

2024-11-12 Thread via GitHub


uschindler commented on PR #13985:
URL: https://github.com/apache/lucene/pull/13985#issuecomment-2471068437

   Hi,
   I am currently on travel, so I can't review this. Will look into it posisbly 
later this week. Greetings from Costa Rica!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]

2024-11-12 Thread via GitHub


viswanathk commented on code in PR #13990:
URL: https://github.com/apache/lucene/pull/13990#discussion_r1838356398


##
lucene/join/src/java/org/apache/lucene/search/join/DiversifyingChildrenByteKnnVectorQuery.java:
##
@@ -154,7 +154,14 @@ protected TopDocs approximateSearch(
 
   @Override
   public String toString(String field) {
-return getClass().getSimpleName() + ":" + this.field + "[" + query[0] + 
",...][" + k + "]";

Review Comment:
   @benwtrent for the Diversifying* does it make sense to capture the 
`parentsFilter` in toString() as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]

2024-11-12 Thread via GitHub


vsop-479 commented on PR #13782:
URL: https://github.com/apache/lucene/pull/13782#issuecomment-2472163944

   Hello @jpountz , Please take a look when you get a chance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Prevent DefaultPassageFormatter from taking shorter overlapping passages [lucene]

2024-11-12 Thread via GitHub


dsmiley commented on code in PR #13384:
URL: https://github.com/apache/lucene/pull/13384#discussion_r1839282216


##
lucene/CHANGES.txt:
##
@@ -280,6 +280,8 @@ Optimizations
 Bug Fixes
 -
 
+* GITHUB#13384: Fix highlighter to use longer passages instead of shorter 
individual terms. (Zack Kendall)

Review Comment:
   Next time please indicate *which* highlighter; we have 3-4!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow easier verification of the Panama Vectorization provider with newer Java versions [lucene]

2024-11-12 Thread via GitHub


ChrisHegarty commented on PR #13986:
URL: https://github.com/apache/lucene/pull/13986#issuecomment-2471543719

   > Personally I would prefer a less if/else/default handling using Optional 
like done in the previous sysprops.
   
   I'll make that change before merging.
   
   > Greetings from Costa Rica!
   
   Enjoy!!!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Improve checksum calculations [lucene]

2024-11-12 Thread via GitHub


jpountz commented on code in PR #13989:
URL: https://github.com/apache/lucene/pull/13989#discussion_r1839685981


##
lucene/core/src/java/org/apache/lucene/store/BufferedChecksum.java:
##
@@ -60,6 +64,37 @@ public void update(byte[] b, int off, int len) {
 }
   }
 
+  void updateShort(short val) {
+if (upto + Short.BYTES > buffer.length) flush();
+BitUtil.VH_LE_SHORT.set(buffer, upto, val);
+upto += Short.BYTES;
+  }
+
+  void updateInt(int val) {
+if (upto + Integer.BYTES > buffer.length) flush();
+BitUtil.VH_LE_INT.set(buffer, upto, val);
+upto += Integer.BYTES;
+  }
+
+  void updateLong(long val) {
+if (upto + Long.BYTES > buffer.length) flush();
+BitUtil.VH_LE_LONG.set(buffer, upto, val);
+upto += Long.BYTES;
+  }
+
+  void updateLongs(long[] vals, int offset, int len) {
+flush();

Review Comment:
   All other `updateXXX()` methods only flush if there is no room left or if 
the data to update is bigger than the buffer size. I'd like to retain this 
property here (even though your benchmarks suggest that it doesn't hurt 
performance much).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] DocValuesSkipper implementation in IndexSortSorted [lucene]

2024-11-12 Thread via GitHub


gsmiller commented on code in PR #13886:
URL: https://github.com/apache/lucene/pull/13886#discussion_r1838528845


##
lucene/core/src/java/org/apache/lucene/search/IndexSortSortedNumericDocValuesRangeQuery.java:
##
@@ -397,106 +413,80 @@ private boolean matchAll(PointValues points, byte[] 
queryLowerPoint, byte[] quer
   }
 
   private IteratorAndCount getDocIdSetIteratorOrNullFromBkd(
-  LeafReaderContext context, DocIdSetIterator delegate) throws IOException 
{
-Sort indexSort = context.reader().getMetaData().sort();
+  LeafReader reader, NumericDocValues numericDocValues, DocValuesSkipper 
skipper)
+  throws IOException {
+if (skipper.docCount() != reader.maxDoc()) {
+  return null;
+}
+final Sort indexSort = reader.getMetaData().sort();
 if (indexSort == null
 || indexSort.getSort().length == 0
 || indexSort.getSort()[0].getField().equals(field) == false) {
   return null;
 }
 
+final int minDocID;
+final int maxDocID;
 final boolean reverse = indexSort.getSort()[0].getReverse();
-
-PointValues points = context.reader().getPointValues(field);

Review Comment:
   Yeah, I'm not really sure to be honest. Ideally we would measure/benchmark 
this. I'm OK with either option as a starting point though, as long as we're 
still utilizing these non-skipper options when a skipper doesn't exist.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]

2024-11-12 Thread via GitHub


benchaplin commented on code in PR #13984:
URL: https://github.com/apache/lucene/pull/13984#discussion_r1838509594


##
lucene/core/src/java/org/apache/lucene/index/CheckIndex.java:
##
@@ -406,6 +411,21 @@ public static final class VectorValuesStatus {
   public Throwable error;
 }
 
+/** Status from testing HNSW graph */
+public static final class HnswGraphStatus {
+
+  HnswGraphStatus() {}
+
+  /** Number of nodes in the graph */
+  public int hnswGraphSize;

Review Comment:
   Good point - renamed it `totalNumNodes` to be extra clear.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Improve checksum calculations [lucene]

2024-11-12 Thread via GitHub


jfboeuf commented on PR #13989:
URL: https://github.com/apache/lucene/pull/13989#issuecomment-2471157022

   @jpountz 
   [I modified the benchmark to make it more realistic by adding a header to 
the `IndexOutput` 
](https://github.com/apache/lucene/commit/8dc6eac23b3a1158ef4c82860d8574c779bad047)
 so:
   * `updateLongs(long[], int, int)` triggers a real flush before processing 
(the benchmark was biased because the buffer was empty, which is unlikely in 
real operations where there is at least the header and the table size before).
   * write accesses to the buffer by `updateLong(long)` are not necessarily 
aligned (they are always with the new `updateLongs(long[], int, int)`).
   
   After testing with different sizes of `long[]` it seems the loop of 
`updateLong(long)` is only worth it for **really small** arrays and this 
specific case probably doesn't deserve the additional complexity/if_branch in 
`updateLongs(long[], int, int)`.
   ```
   Benchmark  (size)   Mode  Cnt
 Score Error   Units
   BufferedChecksumIndexInputBenchmark.decodeLongArray 8  thrpt   15  
5992.000 ± 123.429  ops/ms
   BufferedChecksumIndexInputBenchmark.decodeLongArray16  thrpt   15  
5729.068 ± 123.423  ops/ms
   BufferedChecksumIndexInputBenchmark.decodeLongArray32  thrpt   15  
5307.703 ± 149.885  ops/ms
   BufferedChecksumIndexInputBenchmark.decodeLongArray64  thrpt   15  
4960.178 ± 124.810  ops/ms
   BufferedChecksumIndexInputBenchmark.decodeSingleLongs   8  thrpt   15  
6182.986 ± 253.181  ops/ms
   BufferedChecksumIndexInputBenchmark.decodeSingleLongs  16  thrpt   15  
5472.158 ± 190.560  ops/ms
   BufferedChecksumIndexInputBenchmark.decodeSingleLongs  32  thrpt   15  
4739.634 ± 404.963  ops/ms
   BufferedChecksumIndexInputBenchmark.decodeSingleLongs  64  thrpt   15  
3858.685 ±  81.880  ops/ms
   ```
   @rmuir I'll add unit tests to thoroughly check the method behaves properly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]

2024-11-12 Thread via GitHub


benchaplin commented on code in PR #13984:
URL: https://github.com/apache/lucene/pull/13984#discussion_r1838595160


##
lucene/core/src/java/org/apache/lucene/index/CheckIndex.java:
##
@@ -2746,6 +2769,84 @@ public static Status.VectorValuesStatus testVectors(
 return status;
   }
 
+  private static HnswGraph getHnswGraph(CodecReader reader) throws IOException 
{
+KnnVectorsReader vectorsReader = reader.getVectorReader();
+if (vectorsReader instanceof PerFieldKnnVectorsFormat.FieldsReader) {
+  vectorsReader = ((PerFieldKnnVectorsFormat.FieldsReader) 
vectorsReader).getFieldReader("knn");

Review Comment:
   Thanks, I didn't quite understand fields when I wrote this - I think I get 
it now. Alright, I've done what you suggested (as is also done in 
`testVectors`) and iterated over `FieldInfos`, performing the check only when 
it applies. 
   
   Because we might now parse several HNSW graphs, I've restructured the status 
object to support per-graph data. Successful output will now look like:
   
   ```
   test: open reader.OK [took 0.010 sec]
   test: check integrity.OK [took 2.216 sec]
   test: check live docs.OK [took 0.000 sec]
   test: field infos.OK [2 fields] [took 0.000 sec]
   test: field norms.OK [0 fields] [took 0.000 sec]
   test: terms, freq, prox...test: stored fields...OK [150 
total field count; avg 1.0 fields per doc] [took 0.390 sec]
   test: term vectorsOK [0 total term vector count; avg 0.0 
term/freq vector fields per doc] [took 0.000 sec]
   test: docvalues...OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 
SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET; 0 SKIPPING INDEX] [took 0.000 sec]
   test: points..OK [0 fields, 0 points] [took 0.000 sec]
   test: vectors.OK [1 fields, 150 vectors] [took 0.496 sec]
   test: hnsw graphs.OK [2 fields: (field name: knn1, levels: 4, 
total nodes: 1547684), (field name: knn2, levels: 4, total nodes: 1547684)] 
[took 0.979 sec]
   ```
   
   `testVectors` doesn't do this, it just sums vectors over all fields. I could 
do that too, but this felt most complete.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Improve checksum calculations [lucene]

2024-11-12 Thread via GitHub


rmuir commented on PR #13989:
URL: https://github.com/apache/lucene/pull/13989#issuecomment-2471135059

   OK, I see @jfboeuf, thank you for the explanation.  My only concern with 
with the optimization is testing. If there is a bug here, the user will get 
CorruptIndexException.
   
   Could we add some low-level tests somehow? 
   
   Especially the readLongs() seems like potential trouble, as I am not sure 
anything tests that it works correctly with a non-zero `offset`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]

2024-11-12 Thread via GitHub


viswanathk commented on PR #13990:
URL: https://github.com/apache/lucene/pull/13990#issuecomment-2470836662

   I hope I got all of them now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]

2024-11-12 Thread via GitHub


benwtrent commented on code in PR #13990:
URL: https://github.com/apache/lucene/pull/13990#discussion_r1838315863


##
lucene/join/src/java/org/apache/lucene/search/join/DiversifyingChildrenByteKnnVectorQuery.java:
##
@@ -154,7 +154,14 @@ protected TopDocs approximateSearch(
 
   @Override
   public String toString(String field) {
-return getClass().getSimpleName() + ":" + this.field + "[" + query[0] + 
",...][" + k + "]";

Review Comment:
   test updates for this and the float version? The tests are in: 
TestParentBlockJoinByteKnnVectorQuery and of course the float version as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add a Better Binary Quantizer format for dense vectors [lucene]

2024-11-12 Thread via GitHub


mikemccand commented on PR #13651:
URL: https://github.com/apache/lucene/pull/13651#issuecomment-2470394946

   > @ShashwatShivam I don't think there is a "memory column" provided 
anywhere. I simply looked at the individual file sizes (veb, vex) and summed 
their sizes together.
   
   Once this cool change is merged let's fix `luceneutil`'s KNN benchy tooling 
(`knnPerfTest.py`, `KnnGraphTester.java`) to compute/report the "memory column" 
("hot RAM", "searchable RAM", something)?  Basically everything except the 
original (`float32` or `byte`) vectors.  I'll open an upstream `luceneutil` 
issue...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add a Better Binary Quantizer format for dense vectors [lucene]

2024-11-12 Thread via GitHub


benwtrent commented on PR #13651:
URL: https://github.com/apache/lucene/pull/13651#issuecomment-2470503074

   Quick update, we have been bothered with some of the numbers (for example, 
models like "gist" perform poorly) and we have some improvements to get done 
first before flipping back to "ready for review". 
   
   
   @mikemccand YES! That would be great! "Memory required" would be the 
quantized file size + hnsw graph file size (if the graph exists).
   
   
   
   @ShashwatShivam 
   
   Sorry for the late reply. There are no "out of the box" rescoring actions 
directly in Lucene. Mainly because the individual tools are (mostly) already 
available to you. You can ask for more overall vectors with one query, and then 
rescore the individual documents according to the raw vector comparisons. I 
admit, this requires some Lucene API know how.
   
   
   It would be good for a "vector scorer" to indicate if its an estimation or 
not to allow for smarter actions in the knn doc collector...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]

2024-11-12 Thread via GitHub


jpountz commented on PR #13990:
URL: https://github.com/apache/lucene/pull/13990#issuecomment-2470768672

   I believe that @benwtrent meant `KnnByteVectorQuery`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]

2024-11-12 Thread via GitHub


mikemccand commented on code in PR #13984:
URL: https://github.com/apache/lucene/pull/13984#discussion_r1837944575


##
lucene/core/src/java/org/apache/lucene/index/CheckIndex.java:
##
@@ -406,6 +411,21 @@ public static final class VectorValuesStatus {
   public Throwable error;
 }
 
+/** Status from testing HNSW graph */
+public static final class HnswGraphStatus {
+
+  HnswGraphStatus() {}
+
+  /** Number of nodes in the graph */
+  public int hnswGraphSize;
+
+  /** Number of levels of the graph */
+  public int hsnwGraphNumLevels;

Review Comment:
   `graphNumLevels`?
   
   Naming is the hardest part!



##
lucene/core/src/java/org/apache/lucene/index/CheckIndex.java:
##
@@ -406,6 +411,21 @@ public static final class VectorValuesStatus {
   public Throwable error;
 }
 
+/** Status from testing HNSW graph */
+public static final class HnswGraphStatus {
+
+  HnswGraphStatus() {}
+
+  /** Number of nodes in the graph */
+  public int hnswGraphSize;

Review Comment:
   Maybe rename to `graphNumNodes` or so?  `size` is a bit ambiguous/tricky, 
and since `hnsw` is in the class name we can prolly leave it off the member 
name?



##
lucene/core/src/java/org/apache/lucene/index/CheckIndex.java:
##
@@ -2746,6 +2769,84 @@ public static Status.VectorValuesStatus testVectors(
 return status;
   }
 
+  private static HnswGraph getHnswGraph(CodecReader reader) throws IOException 
{
+KnnVectorsReader vectorsReader = reader.getVectorReader();
+if (vectorsReader instanceof PerFieldKnnVectorsFormat.FieldsReader) {
+  vectorsReader = ((PerFieldKnnVectorsFormat.FieldsReader) 
vectorsReader).getFieldReader("knn");

Review Comment:
   Hmm the field can be anything (not necessarily `knn` like `luceneutil` 
uses)?  Can we change this so we iterate `FieldInfos` and for any field that 
has KNN vectors enabled (`FieldInfo.hasVectorValues`, hmm but also has the HNSW 
graph (since I think a field can be only "flat" vectors?)) we do this check?



##
lucene/core/src/java/org/apache/lucene/index/CheckIndex.java:
##
@@ -2746,6 +2769,84 @@ public static Status.VectorValuesStatus testVectors(
 return status;
   }
 
+  private static HnswGraph getHnswGraph(CodecReader reader) throws IOException 
{
+KnnVectorsReader vectorsReader = reader.getVectorReader();
+if (vectorsReader instanceof PerFieldKnnVectorsFormat.FieldsReader) {
+  vectorsReader = ((PerFieldKnnVectorsFormat.FieldsReader) 
vectorsReader).getFieldReader("knn");
+  if (vectorsReader instanceof HnswGraphProvider) {
+return ((HnswGraphProvider) vectorsReader).getGraph("knn");
+  }
+}
+return null;
+  }
+
+  /** Test the HNSW graph. */
+  public static Status.HnswGraphStatus testHnswGraph(
+  CodecReader reader, PrintStream infoStream, boolean failFast) throws 
IOException {
+if (infoStream != null) {
+  infoStream.print("test: hnsw graph..");
+}
+long startNS = System.nanoTime();
+Status.HnswGraphStatus status = new Status.HnswGraphStatus();
+
+try {
+  HnswGraph hnswGraph = getHnswGraph(reader);
+  if (hnswGraph != null) {
+final int numLevels = hnswGraph.numLevels();
+// Perform tests on each level of the HNSW graph
+for (int level = numLevels - 1; level >= 0; level--) {
+  HnswGraph.NodesIterator nodesIterator = 
hnswGraph.getNodesOnLevel(level);
+  while (nodesIterator.hasNext()) {
+int node = nodesIterator.nextInt();
+hnswGraph.seek(level, node);
+int nbr, lastNeighbor = -1, firstNeighbor = -1;
+while ((nbr = hnswGraph.nextNeighbor()) != NO_MORE_DOCS) {
+  if (firstNeighbor == -1) {
+firstNeighbor = nbr;
+  }
+  if (nbr < lastNeighbor) {
+throw new CheckIndexException(
+"Neighbors out of order for node "
++ node
++ ": "
++ nbr
++ "<"
++ lastNeighbor
++ " 1st="
++ firstNeighbor);
+  } else if (nbr == lastNeighbor) {
+throw new CheckIndexException(
+"There are repeated neighbors of node " + node + " with 
value " + nbr);
+  }
+  lastNeighbor = nbr;
+}
+status.hnswGraphSize++;
+  }
+  status.hsnwGraphNumLevels++;

Review Comment:
   Instead of incrementing here, you could just set it to `numLevels` from 
above?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructur

Re: [PR] Simplify codec setup in vector-related tests. [lucene]

2024-11-12 Thread via GitHub


jpountz merged PR #13970:
URL: https://github.com/apache/lucene/pull/13970


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Improve checksum calculations [lucene]

2024-11-12 Thread via GitHub


rmuir commented on PR #13989:
URL: https://github.com/apache/lucene/pull/13989#issuecomment-2470068010

   This is actually slower, we only want to call `updateBytes(byte[])` or the 
checksum calculation is very slow (not vectorized).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Improve checksum calculations [lucene]

2024-11-12 Thread via GitHub


jpountz commented on PR #13989:
URL: https://github.com/apache/lucene/pull/13989#issuecomment-2470814754

   The change makes sense to me and looks like it could speed up loading live 
docs.
   
   > The benchmark shows the single-long approach performs better on small 
arrays. [...] It can be improved by making updateLongs(long[], int, int) switch 
to a loop over updateLong(long) when the length to checksum fits in the buffer.
   
   Something like this would make sense to me, let's make it look as similar as 
possible to `update(byte[], int, int)`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]

2024-11-12 Thread via GitHub


viswanathk commented on code in PR #13990:
URL: https://github.com/apache/lucene/pull/13990#discussion_r1838322211


##
lucene/join/src/java/org/apache/lucene/search/join/DiversifyingChildrenByteKnnVectorQuery.java:
##
@@ -154,7 +154,14 @@ protected TopDocs approximateSearch(
 
   @Override
   public String toString(String field) {
-return getClass().getSimpleName() + ":" + this.field + "[" + query[0] + 
",...][" + k + "]";

Review Comment:
   yeah, i did check them. they didn't have existing tests that cover 
`toString` so skipped them. will add that coverage though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]

2024-11-12 Thread via GitHub


viswanathk commented on code in PR #13990:
URL: https://github.com/apache/lucene/pull/13990#discussion_r1838356398


##
lucene/join/src/java/org/apache/lucene/search/join/DiversifyingChildrenByteKnnVectorQuery.java:
##
@@ -154,7 +154,14 @@ protected TopDocs approximateSearch(
 
   @Override
   public String toString(String field) {
-return getClass().getSimpleName() + ":" + this.field + "[" + query[0] + 
",...][" + k + "]";

Review Comment:
   @benwtrent for the Diversifying* does it make sense to capture the 
`childFilter` and `parentsFilter` in toString() as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Update lastDoc in ScoreCachingWrappingScorer [lucene]

2024-11-12 Thread via GitHub


mikemccand commented on code in PR #13987:
URL: https://github.com/apache/lucene/pull/13987#discussion_r1837971840


##
lucene/core/src/test/org/apache/lucene/search/TestTopFieldCollector.java:
##
@@ -359,7 +359,7 @@ public void testTotalHitsWithScore() throws Exception {
   leafCollector.collect(1);
 
   scorer.score = 4;
-  leafCollector.collect(1);
+  leafCollector.collect(2);

Review Comment:
   Another pre-existing test bug fixed?



##
lucene/core/src/test/org/apache/lucene/search/TestPositiveScoresOnlyCollector.java:
##
@@ -114,8 +114,9 @@ public void testNegativeScores() throws Exception {
 Collector c = new PositiveScoresOnlyCollector(tdc);
 LeafCollector ac = c.getLeafCollector(ir.leaves().get(0));
 ac.setScorer(s);
-while (s.iterator().nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
-  ac.collect(0);
+int docId;
+while ((docId = s.iterator().nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
+  ac.collect(docId);

Review Comment:
   Wow, good catch!  So this was a latent pre-existing test bug, and with your 
above bug fix this test is now (correctly!) failing, and with your fix to this 
separate test bug, the test now passes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]

2024-11-12 Thread via GitHub


RS146BIJAY commented on issue #13387:
URL: https://github.com/apache/lucene/issues/13387#issuecomment-2470112408

   On some more analysis figured out an approach which addresses all the above 
comments and obtain same improvement with different IndexWriter for different 
group as we got with using different DWPTs for different group.
   
   ## Using separate IndexWriter for maintaining different tenants with a 
combined view
   
   ### Current Issue
   
   Maintaining separate IndexWriter for different groups (tenant) presents a 
significant problem as they do not function as a single unified entity. 
Although distinct IndexWriters and directories for each group ensures that data 
belonging to different groups are kept in separate segments and segments within 
the same group are merged, a unified read-only view for Client (OpenSearch) to 
interact with these multiple group-level IndexWriters is still needed.
   
   Lucene’s addIndexes api offers a way to combine group-level IndexWriters 
into single parent-level IndexWriter, but this approach has multiple drawbacks:
   
   1. Since writes may continue on group-level IndexWriters, periodic 
synchronisation with the parent-level IndexWriter is necessary.
   2. During synchronisation, an external lock needs to be placed on the group 
level IndexWriter directory, causing downtime.
   3. Synchronisation will also involve copying files from the group level 
IndexWriter directory to the parent IndexWriter directory, which is 
resource-intensive, consuming disk IO and CPU cycles.
   
   ### Proposal
   
   To address this issue, we propose introducing a mechanism that combines 
group-level IndexWriters as a soft reference to a parent IndexWriter. This will 
be achieved by creating a new variant of the addIndexes API within IndexWriter, 
which will only combine the SegmentInfos of group-level IndexWriter without 
requiring an external lock or copying files across directories. Group-level 
segments will be maintained in separate directories associated with their 
respective group-level IndexWriters.
   
   The client will periodically call (for OpenSearch side this corresponds to 
index refresh interval of 1 sec) this addIndexes API on the parent IndexWriter, 
passing the segmentInfos of child-level IndexWriter as parameters to sync the 
latest SegmentInfos with the parent IndexWriter. While combining the 
SegmentInfos of child-level IndexWriters, the addIndexes API will attach a 
prefix to the segment names to identify the group each Segments belongs to, 
avoiding name conflicts between segments of different group-level IndexWriters.
   
   ![compositeIndexWriter drawio 
(1)](https://github.com/user-attachments/assets/8ddd8568-a352-41ac-bc42-ce3cb4647f8f)
   
   The parent IndexWriter will be associated with a filter directory that will 
distinguishes the tenant using the file name prefix, redirecting any read/write 
operations on a file to the correct group level directory using segment file 
prefix name.
   
    Reason for choosing common view as an IndexWriter
   
   Most interactions of Lucene with the client (OpenSearch) such as opening a 
reader, getting the latest commit info, reopening a Lucene index, etc occurs 
via IndexWriter itself. Thus selecting IndexWriter as a common view made more 
sense.
   
   ### Improvements with multiple IndexWriter with a combined view
   
   We were able to observe around 50% - 60% improvements with multiple 
IndexWriter with a combined view approach similar to what we observed by having 
different DWPTs for different tenant (initial proposal).
   
   ### Considerations
   
   1. The referencing IndexWriter will be a combined read only view for group 
level IndexWriters. Since this IndexWriter does not itself has any segments and 
is only referencing segment Infos of other IndexWriters, write operation like 
segment merge, flush etc should not be performed on this parent IndexWriter 
instance.
   2. We need to consider prefix name attached before segment names when 
[parsing segment 
names](https://github.com/RS146BIJAY/lucene/blob/84811e974f38181b0c1f1e1b5655f674a1584385/lucene/core/src/java/org/apache/lucene/index/IndexFileNames.java#L119).
   3. It will be difficult to support update queries with multi IndexWriter 
approach. For eg: If we are grouping logs on status code and user update the 
status code field of the logs, for lucene, insert and update operations needs 
to be performed on the separate delete queue.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@luce

Re: [PR] Add a Better Binary Quantizer format for dense vectors [lucene]

2024-11-12 Thread via GitHub


ShashwatShivam commented on PR #13651:
URL: https://github.com/apache/lucene/pull/13651#issuecomment-2470937060

   I conducted a benchmark using Cohere's 768-dimensional data. Here are the 
steps I followed for reproducibility:
   
   1. **Set up** the [luceneutil 
repository](https://github.com/mikemccand/luceneutil/) following the 
installation instructions provided.
   2. **Switch branches** to [this specific 
branch](https://github.com/mikemccand/luceneutil/compare/main...benwtrent:luceneutil:bbq)
 since the latest mainline branch is not compatible with the feature needed for 
this experiment.
   3. **Change the branch** of `lucene_candidate` to 
[benwtrent:feature/adv-binarization-format](https://github.com/benwtrent/lucene/tree/feature/adv-binarization-format)
 to incorporate advanced binarization formats.
   4. **Run** `knnPerfTest.py` after specifying the document and query file 
paths to the stored Cohere data files. The runtime parameters were set as 
follows:
  - `nDoc = 500,000`
  - `topk = 10`
  - `fanout = 100`
  - `maxConn = 32`
  - `beamWidth = 100`
  - `oversample` values tested: `{1, 1.5, 2, 3, 4, 5}`
  
  I used `quantizeBits = 1` for RaBitQ+HNSW and `quantizeBits = 32` for 
regular HNSW.
   
   A comparison was performed between HNSW and RaBitQ, and I observed the 
recall-latency tradeoff, which is shown in the attached image:  
   
![output](https://github.com/user-attachments/assets/8f5f8795-8386-422a-8a9a-d7fd9e7051d2).
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Could Lucene's default Directory (`FSDirectory.open`) somehow preload `.vec` files? [lucene]

2024-11-12 Thread via GitHub


ChrisHegarty commented on issue #13551:
URL: https://github.com/apache/lucene/issues/13551#issuecomment-2470361869

   Late to the party!!! 
   
   I want to dig a little further on the distinction between the implementation 
of preload and prefetch, at least with the mmap implementation. The former 
touches every page (effectively forcing into the cache), while the latter 
simply advises of the access pattern.
   
   For sequential access, a combination of sequential advice and prefetch works 
really well, but not so well for random access. Random access, at least for the 
HNSW graph and the related vector data (be it quantised or not), really needs 
to be resident in memory. Preload gives us that, at least at the time of 
opening. Prefect does not.
   
   For optimized vector search use cases, assuming enough RAM, I was thinking 
that a separate `IndexInput::load` could be useful. This would be similar to 
preload (touches every page), but can be called after opening. It may or may 
not be super helpful, it mostly depends on how things get setup, etc, and 
whether or not preload is good enough for your use case.
   
   For optimized vector search use cases, assuming enough RAM, I was thinking 
that a separate `IndexInput::lock` could be useful. Basically a wrapper around 
`mlock`. The idea being that Vector search perf sucks when there is not enough 
RAM to keep the graph and vector data resident in memory, so allow to load and 
lock it, then leave the tradeoff to the user to decide if they want this "fail 
fast" rather then "search slowly" behaviour (at least for pure vector search).
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Improve checksum calculations [lucene]

2024-11-12 Thread via GitHub


jfboeuf commented on PR #13989:
URL: https://github.com/apache/lucene/pull/13989#issuecomment-2470657083

   Thank you for your feedback. Perhaps I misunderstood your point, but the 
implementation I propose only calls `Checksum.update(byte[])`. The change 
resides in how the buffer is fed to avoid byte-by-byte reading from the 
underlying `IndexInput` and byte-by-byte copy to the buffer. I created and 
pushed [into a different branch a JMH 
benchmark](https://github.com/apache/lucene/commit/91482728ea1633213bc064ed5362be041101d1d5)
 comparing the current implementation to the one I propose. The results show a 
noticeable improvement:
   ```
   Benchmark  (size)   Mode  
Cnt Score Error   Units
   BufferedChecksumIndexInputBenchmark.decodeLongArray10  thrpt   
15  7521.843 ± 167.756  ops/ms
   BufferedChecksumIndexInputBenchmark.decodeLongArray  1000  thrpt   
15  1004.213 ±   3.587  ops/ms
   BufferedChecksumIndexInputBenchmark.decodeLongArray10  thrpt   
1510.993 ±   0.102  ops/ms
   BufferedChecksumIndexInputBenchmark.decodeLongArrayOrig10  thrpt   
15  3865.018 ±  38.850  ops/ms
   BufferedChecksumIndexInputBenchmark.decodeLongArrayOrig  1000  thrpt   
1546.381 ±   0.293  ops/ms
   BufferedChecksumIndexInputBenchmark.decodeLongArrayOrig10  thrpt   
15 0.475 ±   0.022  ops/ms
   BufferedChecksumIndexInputBenchmark.decodeSingleLongs  10  thrpt   
15  8355.638 ±  52.745  ops/ms
   BufferedChecksumIndexInputBenchmark.decodeSingleLongs1000  thrpt   
15   212.133 ±   4.296  ops/ms
   BufferedChecksumIndexInputBenchmark.decodeSingleLongs  10  thrpt   
15 2.744 ±   0.021  ops/ms
   BufferedChecksumIndexInputBenchmark.decodeSingleLongsOrig  10  thrpt   
15  3938.857 ±  42.751  ops/ms
   BufferedChecksumIndexInputBenchmark.decodeSingleLongsOrig1000  thrpt   
1547.246 ±   0.444  ops/ms
   BufferedChecksumIndexInputBenchmark.decodeSingleLongsOrig  10  thrpt   
15 0.460 ±   0.020  ops/ms
   ``` 
   For large arrays, there can be an improvement of up to 23 times when reading 
long arrays and 6 times when reading single long values. Transitioning from 
reading single long values to long arrays for live documents and Bloom Filters— 
bitsets being commonly large in both scenarios—results in even greater 
performance enhancements.
   
   The benchmark shows the single-long approach performs better on small 
arrays. This is likely due to the cost of wrapping the `byte[]` to a 
`ByteBuffer` and `LongBuffer` that is not paid off to copy a few bytes. It can 
be improved by making `updateLongs(long[], int, int)` switch to a loop over 
`updateLong(long)` when the length to checksum fits in the buffer. What do you 
think?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]

2024-11-12 Thread via GitHub


viswanathk commented on PR #13990:
URL: https://github.com/apache/lucene/pull/13990#issuecomment-2470717378

   > Could you update the byte knn query & DiversifyingChildern* knn queries as 
well?
   
   I made the changes fo DiversifyingChildren*, but by byteknn do you mean 
`ByteVectorSimilarityQuery`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]

2024-11-12 Thread via GitHub


benwtrent commented on code in PR #13990:
URL: https://github.com/apache/lucene/pull/13990#discussion_r1838141983


##
lucene/core/src/test/org/apache/lucene/search/TestKnnFloatVectorQuery.java:
##
@@ -29,13 +29,7 @@
 import org.apache.lucene.document.Field;
 import org.apache.lucene.document.KnnFloatVectorField;
 import org.apache.lucene.document.StringField;
-import org.apache.lucene.index.DirectoryReader;
-import org.apache.lucene.index.IndexReader;
-import org.apache.lucene.index.IndexWriter;
-import org.apache.lucene.index.IndexWriterConfig;
-import org.apache.lucene.index.LeafReaderContext;
-import org.apache.lucene.index.QueryTimeout;
-import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.index.*;

Review Comment:
   Let's not collapse to `*`



##
lucene/core/src/java/org/apache/lucene/search/KnnFloatVectorQuery.java:
##
@@ -112,7 +112,14 @@ VectorScorer createVectorScorer(LeafReaderContext context, 
FieldInfo fi) throws
 
   @Override
   public String toString(String field) {
-return getClass().getSimpleName() + ":" + this.field + "[" + target[0] + 
",...][" + k + "]";
+StringBuilder buffer = new StringBuilder();
+buffer.append(getClass().getSimpleName() + ":");
+buffer.append(this.field + "[" + target[0] + ",...]");
+buffer.append("[" + k + "]");
+if (this.filter != null) {
+  buffer.append(",filter:[" + this.filter + "]");
+}

Review Comment:
   ```suggestion
   if (this.filter != null) {
 buffer.append("[" + this.filter + "]");
   }
   ```
   
   I think that would be better.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]

2024-11-12 Thread via GitHub


mikemccand commented on PR #13984:
URL: https://github.com/apache/lucene/pull/13984#issuecomment-2470306515

   > this might also check for graph connectivity, see #12627
   
   +1 -- this is a tricky thing about these HNSW graphs (that they are not 
necessarily born connected but rather must be stitched up somehow afterwards).
   
   But maybe do this separately?  I'd love to just get this initial coverage in 
first...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]

2024-11-12 Thread via GitHub


mikemccand commented on PR #13984:
URL: https://github.com/apache/lucene/pull/13984#issuecomment-2470310487

   Actually, `CheckIndex` does have some coverage for vectors and KNN graph (it 
confirms it can enumerate all vectors, and also runs some searches on it if 
it's not just flat vectors (`checkByte/FloatVectorValues`) , so it's not 
completely blind :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Update lastDoc in ScoreCachingWrappingScorer [lucene]

2024-11-12 Thread via GitHub


mikemccand commented on PR #13987:
URL: https://github.com/apache/lucene/pull/13987#issuecomment-2470331708

   Wow, good catch @msfroh.  Could we maybe add a new test case that explicitly 
confirms that the wrapped `Scorable`'s `score` method is indeed only called 
once even if the outer user calls `.score()` multiple times on the same 
`docID`?  This seems to be the primary purpose of this class, and it was 
failing this, for a very long time!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Introduces IndexInput#updateReadAdvice to change the readadvice while [lucene]

2024-11-12 Thread via GitHub


ChrisHegarty commented on PR #13985:
URL: https://github.com/apache/lucene/pull/13985#issuecomment-2470311965

   @shatejas I'm curious how much this actually helps, and I know that you said 
that benchmark results would be posted.
   
   I do like that we can update the ReadAdvice on an index input 👍  What is 
unclear is how much the sequential advise here helps over something like a 
`load` (similar to `MemorySegment::load`), that touches every page. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Addresses the issue #13983 by adding filter to the toString() method of KnnFloatVectorQuery [lucene]

2024-11-12 Thread via GitHub


viswanathk opened a new pull request, #13990:
URL: https://github.com/apache/lucene/pull/13990

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org