Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]

2024-11-13 Thread via GitHub


msokolov commented on code in PR #13984:
URL: https://github.com/apache/lucene/pull/13984#discussion_r1840202966


##
lucene/core/src/java/org/apache/lucene/index/CheckIndex.java:
##
@@ -2746,6 +2781,106 @@ public static Status.VectorValuesStatus testVectors(
 return status;
   }
 
+  /** Test the HNSW graph. */
+  public static Status.HnswGraphsStatus testHnswGraphs(
+  CodecReader reader, PrintStream infoStream, boolean failFast) throws 
IOException {
+if (infoStream != null) {
+  infoStream.print("test: hnsw graphs.");
+}
+long startNS = System.nanoTime();
+Status.HnswGraphsStatus status = new Status.HnswGraphsStatus();
+KnnVectorsReader vectorsReader = reader.getVectorReader();
+FieldInfos fieldInfos = reader.getFieldInfos();
+
+try {
+  if (fieldInfos.hasVectorValues()) {
+for (FieldInfo fieldInfo : fieldInfos) {
+  if (fieldInfo.hasVectorValues()) {
+if (vectorsReader instanceof 
PerFieldKnnVectorsFormat.FieldsReader) {
+  KnnVectorsReader fieldReader =
+  ((PerFieldKnnVectorsFormat.FieldsReader) vectorsReader)
+  .getFieldReader(fieldInfo.name);
+  if (fieldReader instanceof HnswGraphProvider) {
+HnswGraph hnswGraph = ((HnswGraphProvider) 
fieldReader).getGraph(fieldInfo.name);
+testHnswGraph(hnswGraph, fieldInfo.name, status);
+  }
+}
+  }
+}
+  }
+  StringJoiner hnswGraphResultJoiner = new StringJoiner(", ");
+  for (Map.Entry hnswGraphStatus :
+  status.hnswGraphsStatusByField.entrySet()) {
+hnswGraphResultJoiner.add(
+String.format(
+Locale.ROOT,
+"(field name: %s, levels: %d, total nodes: %d)",
+hnswGraphStatus.getKey(),
+hnswGraphStatus.getValue().numLevels,
+hnswGraphStatus.getValue().totalNumNodes));
+  }
+  msg(
+  infoStream,
+  String.format(
+  Locale.ROOT,
+  "OK [%d fields%s] [took %.3f sec]",
+  status.hnswGraphsStatusByField.size(),
+  hnswGraphResultJoiner.toString().isEmpty() ? "" : ": " + 
hnswGraphResultJoiner,
+  nsToSec(System.nanoTime() - startNS)));
+} catch (Throwable e) {
+  if (failFast) {
+throw IOUtils.rethrowAlways(e);
+  }
+  msg(infoStream, "ERROR: " + e);
+  status.error = e;
+  if (infoStream != null) {
+e.printStackTrace(infoStream);
+  }
+}
+
+return status;
+  }
+
+  private static void testHnswGraph(
+  HnswGraph hnswGraph, String fieldName, Status.HnswGraphsStatus status)
+  throws IOException, CheckIndexException {
+if (hnswGraph != null) {
+  status.hnswGraphsStatusByField.put(fieldName, new 
Status.HnswGraphStatus());
+  final int numLevels = hnswGraph.numLevels();
+  // Perform tests on each level of the HNSW graph
+  for (int level = numLevels - 1; level >= 0; level--) {
+HnswGraph.NodesIterator nodesIterator = 
hnswGraph.getNodesOnLevel(level);
+while (nodesIterator.hasNext()) {
+  int node = nodesIterator.nextInt();
+  hnswGraph.seek(level, node);
+  int nbr, lastNeighbor = -1, firstNeighbor = -1;
+  while ((nbr = hnswGraph.nextNeighbor()) != NO_MORE_DOCS) {
+if (firstNeighbor == -1) {

Review Comment:
   perhaps at some point we could also check for uniqueness of the neighbors, 
and also check that the neighbors are in range [0, numGraphNodes], and finally 
on levels > 0, we would want to assert that the neighbors are on this level. 
But this can all be done separately; maybe we could add a comment here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new Directory implementation for AWS S3 [lucene]

2024-11-13 Thread via GitHub


jpountz commented on PR #13949:
URL: https://github.com/apache/lucene/pull/13949#issuecomment-2472787711

   Since we have a S3 Directory implemented already, let's run a comparison 
with the fuse mount approach?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow easier verification of the Panama Vectorization provider with newer Java versions [lucene]

2024-11-13 Thread via GitHub


uschindler commented on PR #13986:
URL: https://github.com/apache/lucene/pull/13986#issuecomment-2473792705

   > > > Personally I would prefer a less if/else/default handling using 
Optional like done in the previous sysprops.
   > > 
   > > 
   > > I'll make that change before merging.
   > 
   > Well... I tried, and it was awful! The reason being the need to parseInt 
the value and catch the exception, which, while possible, just lead to more 
convoluted code. Easy to refactor later, if at all.
   
   I would have left the code to bail out if the Sysprop is empty or no valid 
integer, so not catch exception at all.
   
   Anyways, it's good as is.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] DocValuesSkipper implementation in IndexSortSorted [lucene]

2024-11-13 Thread via GitHub


gsmiller commented on code in PR #13886:
URL: https://github.com/apache/lucene/pull/13886#discussion_r1840714602


##
lucene/core/src/java/org/apache/lucene/search/IndexSortSortedNumericDocValuesRangeQuery.java:
##
@@ -397,106 +413,80 @@ private boolean matchAll(PointValues points, byte[] 
queryLowerPoint, byte[] quer
   }
 
   private IteratorAndCount getDocIdSetIteratorOrNullFromBkd(
-  LeafReaderContext context, DocIdSetIterator delegate) throws IOException 
{
-Sort indexSort = context.reader().getMetaData().sort();
+  LeafReader reader, NumericDocValues numericDocValues, DocValuesSkipper 
skipper)
+  throws IOException {
+if (skipper.docCount() != reader.maxDoc()) {
+  return null;
+}
+final Sort indexSort = reader.getMetaData().sort();
 if (indexSort == null
 || indexSort.getSort().length == 0
 || indexSort.getSort()[0].getField().equals(field) == false) {
   return null;
 }
 
+final int minDocID;
+final int maxDocID;
 final boolean reverse = indexSort.getSort()[0].getReverse();
-
-PointValues points = context.reader().getPointValues(field);

Review Comment:
   @iverase hmm, I thought I'd responded a few days back but looks like it 
didn't save for some reason (probably user error). Anyway, I think your take on 
this is very reasonable but it would be nice to measure if possible to quantify 
this in some way. But yeah, I'm fine with either order as a starting point.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Strange javac "automatic module" warning for `benchmark-jmh` [lucene]

2024-11-13 Thread via GitHub


mikemccand opened a new issue, #13992:
URL: https://github.com/apache/lucene/issues/13992

   ### Description
   
   I suddenly noticed this warning when running top-level `./gradlew jar` in 
Lucene `main`:
   
   ```
   /s1/l/trunk/lucene/benchmark-jmh/src/java/module-info.java:20: warning: 
requires directive for an automatic module
 requires jmh.core;
 ^
   1 warning
   ```
   
   I think it is related to java modules but I'm not sure how to fix/suppress 
it :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]

2024-11-13 Thread via GitHub


benchaplin commented on PR #13984:
URL: https://github.com/apache/lucene/pull/13984#issuecomment-2474165424

   Gotcha, thanks for the comments @msokolov! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Improve checksum calculations [lucene]

2024-11-13 Thread via GitHub


jfboeuf commented on code in PR #13989:
URL: https://github.com/apache/lucene/pull/13989#discussion_r1840804892


##
lucene/core/src/java/org/apache/lucene/store/BufferedChecksum.java:
##
@@ -60,6 +64,37 @@ public void update(byte[] b, int off, int len) {
 }
   }
 
+  void updateShort(short val) {
+if (upto + Short.BYTES > buffer.length) flush();
+BitUtil.VH_LE_SHORT.set(buffer, upto, val);
+upto += Short.BYTES;
+  }
+
+  void updateInt(int val) {
+if (upto + Integer.BYTES > buffer.length) flush();
+BitUtil.VH_LE_INT.set(buffer, upto, val);
+upto += Integer.BYTES;
+  }
+
+  void updateLong(long val) {
+if (upto + Long.BYTES > buffer.length) flush();
+BitUtil.VH_LE_LONG.set(buffer, upto, val);
+upto += Long.BYTES;
+  }
+
+  void updateLongs(long[] vals, int offset, int len) {
+flush();

Review Comment:
   I pushed the modification to limit the flush. Not only is the buffer no 
longer systematically flushed entering the `updateLongs(long[], int, int)` but 
also the last chunk is not flushed which results beneficial when in the likely 
case there is remaining data (the codec footer) after the `long[]` that could 
fit in the buffer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Introduces IndexInput#updateReadAdvice to change the readadvice while [lucene]

2024-11-13 Thread via GitHub


shatejas commented on PR #13985:
URL: https://github.com/apache/lucene/pull/13985#issuecomment-2474147590

   > I'm curious how much this actually helps, and I know that you said that 
benchmark results would be posted.
   
   @ChrisHegarty Preliminary results showed approximately 40mins (~13%) 
reduction in total force merge time for 10m dataset. You should be able to find 
details here 
https://github.com/opensearch-project/k-NN/issues/2134#issuecomment-2420793541. 
   Though the setup does not use LuceneHNSW it does use 
Lucene99FlatVectorsReader.
   
   > What is unclear is how much the sequential advise here helps over 
something like a load (similar to MemorySegment::load), that touches every page.
   
   I did try preload which does `MemorySegment::load` and there was an 
improvement in force merge time. The advantage I see with updateAdvice is being 
able to delay the loading of vectors till needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]

2024-11-13 Thread via GitHub


msokolov commented on PR #13984:
URL: https://github.com/apache/lucene/pull/13984#issuecomment-2474129035

   
   > Doesn't (neighbors are in order) and (neighbors are not repeated) 
verify uniqueness?
   
   duh, yes :)
   
   > Also, why wouldn't we assert neighbors are on this level for level 0?
   
   yes, we should, it's just that for level 0 this is the same as node in [0, 
size) wheras other levels are sparse - they have the same max = size-1, but not 
every node is included


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Update lastDoc in ScoreCachingWrappingScorer [lucene]

2024-11-13 Thread via GitHub


msfroh commented on code in PR #13987:
URL: https://github.com/apache/lucene/pull/13987#discussion_r1840841444


##
lucene/core/src/test/org/apache/lucene/search/TestScoreCachingWrappingScorer.java:
##
@@ -157,4 +157,40 @@ public void testGetScores() throws Exception {
 ir.close();
 directory.close();
   }
+
+  private static class CountingScorable extends FilterScorable {
+int count = 0;
+
+public CountingScorable(Scorable in) {
+  super(in);
+}
+
+@Override
+public float score() throws IOException {
+  count++;
+  return in.score();
+}
+  }
+
+  public void testRepeatedCollectReusesScore() throws Exception {
+Scorer s = new SimpleScorer();
+CountingScorable countingScorable = new CountingScorable(s);
+ScoreCachingCollector scc = new ScoreCachingCollector(scores.length * 2);
+LeafCollector lc = scc.getLeafCollector(null);
+lc.setScorer(countingScorable);
+
+// We need to iterate on the scorer so that its doc() advances.
+int doc;
+while ((doc = s.iterator().nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
+  lc.collect(doc);
+  lc.collect(doc);
+}
+
+for (int i = 0; i < scores.length; i++) {
+  assertEquals(scores[i], scc.mscores[i * 2], 0f);
+  assertEquals(scores[i], scc.mscores[i * 2 + 1], 0f);
+}
+
+assertEquals(scores.length, countingScorable.count);

Review Comment:
   This is similar to the other test method (`testGetScores`), but we assert 
that each score is collected twice (since we collect each document twice), but 
the `score()` method is only called once per document.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Update lastDoc in ScoreCachingWrappingScorer [lucene]

2024-11-13 Thread via GitHub


msfroh commented on code in PR #13987:
URL: https://github.com/apache/lucene/pull/13987#discussion_r1840841444


##
lucene/core/src/test/org/apache/lucene/search/TestScoreCachingWrappingScorer.java:
##
@@ -157,4 +157,40 @@ public void testGetScores() throws Exception {
 ir.close();
 directory.close();
   }
+
+  private static class CountingScorable extends FilterScorable {
+int count = 0;
+
+public CountingScorable(Scorable in) {
+  super(in);
+}
+
+@Override
+public float score() throws IOException {
+  count++;
+  return in.score();
+}
+  }
+
+  public void testRepeatedCollectReusesScore() throws Exception {
+Scorer s = new SimpleScorer();
+CountingScorable countingScorable = new CountingScorable(s);
+ScoreCachingCollector scc = new ScoreCachingCollector(scores.length * 2);
+LeafCollector lc = scc.getLeafCollector(null);
+lc.setScorer(countingScorable);
+
+// We need to iterate on the scorer so that its doc() advances.
+int doc;
+while ((doc = s.iterator().nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
+  lc.collect(doc);
+  lc.collect(doc);
+}
+
+for (int i = 0; i < scores.length; i++) {
+  assertEquals(scores[i], scc.mscores[i * 2], 0f);
+  assertEquals(scores[i], scc.mscores[i * 2 + 1], 0f);
+}
+
+assertEquals(scores.length, countingScorable.count);

Review Comment:
   This is similar to the other test method (`testGetScores`), but we assert 
that each score is collected twice, but the `score()` method is only called 
once per document.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Update lastDoc in ScoreCachingWrappingScorer [lucene]

2024-11-13 Thread via GitHub


jpountz commented on code in PR #13987:
URL: https://github.com/apache/lucene/pull/13987#discussion_r1840904596


##
lucene/core/src/test/org/apache/lucene/search/TestScoreCachingWrappingScorer.java:
##
@@ -157,4 +157,40 @@ public void testGetScores() throws Exception {
 ir.close();
 directory.close();
   }
+
+  private static class CountingScorable extends FilterScorable {
+int count = 0;
+
+public CountingScorable(Scorable in) {
+  super(in);
+}
+
+@Override
+public float score() throws IOException {
+  count++;
+  return in.score();
+}
+  }
+
+  public void testRepeatedCollectReusesScore() throws Exception {
+Scorer s = new SimpleScorer();
+CountingScorable countingScorable = new CountingScorable(s);
+ScoreCachingCollector scc = new ScoreCachingCollector(scores.length * 2);
+LeafCollector lc = scc.getLeafCollector(null);
+lc.setScorer(countingScorable);
+
+// We need to iterate on the scorer so that its doc() advances.
+int doc;
+while ((doc = s.iterator().nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
+  lc.collect(doc);
+  lc.collect(doc);

Review Comment:
   It's illegal to call collect() multiple times on the same doc, this 
shouldn't be necessary as `collect()` already calls `Scorable#score()` multiple 
times?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]

2024-11-13 Thread via GitHub


jpountz commented on code in PR #13990:
URL: https://github.com/apache/lucene/pull/13990#discussion_r1840915954


##
lucene/join/src/java/org/apache/lucene/search/join/DiversifyingChildrenByteKnnVectorQuery.java:
##
@@ -154,7 +154,14 @@ protected TopDocs approximateSearch(
 
   @Override
   public String toString(String field) {
-return getClass().getSimpleName() + ":" + this.field + "[" + query[0] + 
",...][" + k + "]";

Review Comment:
   It's not obvious to me, `parentsFilter` is more a property of the index than 
of the query. You could have multiple levels of parents, which would make this 
property relevant to print, but at the same time I don't think they have proper 
toString() implementations. Let's leave it out for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]

2024-11-13 Thread via GitHub


jpountz commented on PR #13990:
URL: https://github.com/apache/lucene/pull/13990#issuecomment-2474335025

   > Could you update the byte knn query & DiversifyingChildern* knn queries as 
well?
   
   Unrelated: I noticed that the `DiversifyingChildren*` queries don't use the 
filter to evaluate the query, is this a bug?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Update lastDoc in ScoreCachingWrappingScorer [lucene]

2024-11-13 Thread via GitHub


mikemccand commented on code in PR #13987:
URL: https://github.com/apache/lucene/pull/13987#discussion_r1840911855


##
lucene/core/src/test/org/apache/lucene/search/TestScoreCachingWrappingScorer.java:
##
@@ -157,4 +157,40 @@ public void testGetScores() throws Exception {
 ir.close();
 directory.close();
   }
+
+  private static class CountingScorable extends FilterScorable {
+int count = 0;
+
+public CountingScorable(Scorable in) {
+  super(in);
+}
+
+@Override
+public float score() throws IOException {
+  count++;
+  return in.score();
+}
+  }
+
+  public void testRepeatedCollectReusesScore() throws Exception {
+Scorer s = new SimpleScorer();
+CountingScorable countingScorable = new CountingScorable(s);
+ScoreCachingCollector scc = new ScoreCachingCollector(scores.length * 2);
+LeafCollector lc = scc.getLeafCollector(null);
+lc.setScorer(countingScorable);
+
+// We need to iterate on the scorer so that its doc() advances.
+int doc;
+while ((doc = s.iterator().nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
+  lc.collect(doc);
+  lc.collect(doc);

Review Comment:
   Hmm why are we collecting the same `doc` twice?  (I thought this is smelly 
-- we fixed the other two tests to stop doing that?).
   
   Oh I see, is so `score()` happens more than once while on the same doc?  
Actually, since `ScoreCachingCollector`'s `LeafCollector` is calling `.score()` 
three times for each `.collect()`, you should be able to do only one 
`lc.collect(doc)` here?
   
   Does this new test case fail if you revert the bug fix?



##
lucene/core/src/test/org/apache/lucene/search/TestScoreCachingWrappingScorer.java:
##
@@ -157,4 +157,40 @@ public void testGetScores() throws Exception {
 ir.close();
 directory.close();
   }
+
+  private static class CountingScorable extends FilterScorable {
+int count = 0;
+
+public CountingScorable(Scorable in) {
+  super(in);
+}
+
+@Override
+public float score() throws IOException {
+  count++;
+  return in.score();
+}
+  }
+
+  public void testRepeatedCollectReusesScore() throws Exception {
+Scorer s = new SimpleScorer();
+CountingScorable countingScorable = new CountingScorable(s);
+ScoreCachingCollector scc = new ScoreCachingCollector(scores.length * 2);
+LeafCollector lc = scc.getLeafCollector(null);
+lc.setScorer(countingScorable);
+
+// We need to iterate on the scorer so that its doc() advances.
+int doc;
+while ((doc = s.iterator().nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
+  lc.collect(doc);
+  lc.collect(doc);
+}
+
+for (int i = 0; i < scores.length; i++) {
+  assertEquals(scores[i], scc.mscores[i * 2], 0f);
+  assertEquals(scores[i], scc.mscores[i * 2 + 1], 0f);
+}
+
+assertEquals(scores.length, countingScorable.count);

Review Comment:
   Ahh I see -- you want to assert that indeed `score()` was called (up on top) 
multiple times per hit, but then down below in the "true" scorer, it was 
dedup'd to just once per hit.  So that's why you `.collect(doc)` twice, OK ...



##
lucene/core/src/test/org/apache/lucene/search/TestScoreCachingWrappingScorer.java:
##
@@ -157,4 +157,40 @@ public void testGetScores() throws Exception {
 ir.close();
 directory.close();
   }
+
+  private static class CountingScorable extends FilterScorable {
+int count = 0;

Review Comment:
   We don't need the `= 0` since it's java's default already?



##
lucene/core/src/test/org/apache/lucene/search/TestScoreCachingWrappingScorer.java:
##
@@ -157,4 +157,40 @@ public void testGetScores() throws Exception {
 ir.close();
 directory.close();
   }
+
+  private static class CountingScorable extends FilterScorable {
+int count = 0;
+
+public CountingScorable(Scorable in) {
+  super(in);
+}
+
+@Override
+public float score() throws IOException {
+  count++;
+  return in.score();
+}
+  }
+
+  public void testRepeatedCollectReusesScore() throws Exception {
+Scorer s = new SimpleScorer();
+CountingScorable countingScorable = new CountingScorable(s);
+ScoreCachingCollector scc = new ScoreCachingCollector(scores.length * 2);

Review Comment:
   Maybe add a comment about why we need `* 2` here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Multireader Support in Searcher Manager [lucene]

2024-11-13 Thread via GitHub


jpountz commented on PR #13976:
URL: https://github.com/apache/lucene/pull/13976#issuecomment-2472800472

   To add to @vigyasharma, I have been wondering if we should remove 
`SearcherManager` and encourage users to use `IndexReaderManager`. 
`IndexSearcher` is cheap to create and there are som reasons why it may be 
interesting to create a different `IndexSearcher` per query already, e.g. to 
tune parallelism based on current load, or to configure a timeout on the 
`IndexSearcher`.
   
   Maybe `MultiReader` falls in the same bucket. Presumably, the `MultiReader` 
is made from multiple `DirectoryReader`s, so maybe the application code should 
not create a `SearcherManager` that works with a `MultiReader` but instead 
manage two (or more, one per Directory) `ReaderManager`s and dynamically create 
a `MultiReader` and then an `IndexSearcher` on top of this `MultiReader` on 
every search?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] lucene-monitor: make abstract DocumentBatch public [lucene]

2024-11-13 Thread via GitHub


cpoerschke opened a new pull request, #13993:
URL: https://github.com/apache/lucene/pull/13993

   ### Description
   
   The static `DocumentBatch.of` method are already public, if the class itself 
was public too that would allow applications -- e.g. see @kotman12's 
https://github.com/apache/solr/pull/2382 -- to refer to the class e.g. in a 
visitor.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow easier verification of the Panama Vectorization provider with newer Java versions [lucene]

2024-11-13 Thread via GitHub


ChrisHegarty commented on PR #13986:
URL: https://github.com/apache/lucene/pull/13986#issuecomment-2473764740

   > > Personally I would prefer a less if/else/default handling using Optional 
like done in the previous sysprops.
   > 
   > I'll make that change before merging.
   
   Well... I tried, and it was awful! The reason being the need to parseInt the 
value and catch the exception, which, while possible, just lead to more 
convoluted code. Easy to refactor later, if at all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add some basic HNSW graph checks to CheckIndex [lucene]

2024-11-13 Thread via GitHub


benchaplin commented on code in PR #13984:
URL: https://github.com/apache/lucene/pull/13984#discussion_r1840492668


##
lucene/core/src/java/org/apache/lucene/index/CheckIndex.java:
##
@@ -2746,6 +2781,106 @@ public static Status.VectorValuesStatus testVectors(
 return status;
   }
 
+  /** Test the HNSW graph. */
+  public static Status.HnswGraphsStatus testHnswGraphs(
+  CodecReader reader, PrintStream infoStream, boolean failFast) throws 
IOException {
+if (infoStream != null) {
+  infoStream.print("test: hnsw graphs.");
+}
+long startNS = System.nanoTime();
+Status.HnswGraphsStatus status = new Status.HnswGraphsStatus();
+KnnVectorsReader vectorsReader = reader.getVectorReader();
+FieldInfos fieldInfos = reader.getFieldInfos();
+
+try {
+  if (fieldInfos.hasVectorValues()) {
+for (FieldInfo fieldInfo : fieldInfos) {
+  if (fieldInfo.hasVectorValues()) {
+if (vectorsReader instanceof 
PerFieldKnnVectorsFormat.FieldsReader) {
+  KnnVectorsReader fieldReader =
+  ((PerFieldKnnVectorsFormat.FieldsReader) vectorsReader)
+  .getFieldReader(fieldInfo.name);
+  if (fieldReader instanceof HnswGraphProvider) {
+HnswGraph hnswGraph = ((HnswGraphProvider) 
fieldReader).getGraph(fieldInfo.name);
+testHnswGraph(hnswGraph, fieldInfo.name, status);
+  }
+}
+  }
+}
+  }
+  StringJoiner hnswGraphResultJoiner = new StringJoiner(", ");
+  for (Map.Entry hnswGraphStatus :
+  status.hnswGraphsStatusByField.entrySet()) {
+hnswGraphResultJoiner.add(
+String.format(
+Locale.ROOT,
+"(field name: %s, levels: %d, total nodes: %d)",
+hnswGraphStatus.getKey(),
+hnswGraphStatus.getValue().numLevels,
+hnswGraphStatus.getValue().totalNumNodes));
+  }
+  msg(
+  infoStream,
+  String.format(
+  Locale.ROOT,
+  "OK [%d fields%s] [took %.3f sec]",
+  status.hnswGraphsStatusByField.size(),
+  hnswGraphResultJoiner.toString().isEmpty() ? "" : ": " + 
hnswGraphResultJoiner,
+  nsToSec(System.nanoTime() - startNS)));
+} catch (Throwable e) {
+  if (failFast) {
+throw IOUtils.rethrowAlways(e);
+  }
+  msg(infoStream, "ERROR: " + e);
+  status.error = e;
+  if (infoStream != null) {
+e.printStackTrace(infoStream);
+  }
+}
+
+return status;
+  }
+
+  private static void testHnswGraph(
+  HnswGraph hnswGraph, String fieldName, Status.HnswGraphsStatus status)
+  throws IOException, CheckIndexException {
+if (hnswGraph != null) {
+  status.hnswGraphsStatusByField.put(fieldName, new 
Status.HnswGraphStatus());
+  final int numLevels = hnswGraph.numLevels();
+  // Perform tests on each level of the HNSW graph
+  for (int level = numLevels - 1; level >= 0; level--) {
+HnswGraph.NodesIterator nodesIterator = 
hnswGraph.getNodesOnLevel(level);
+while (nodesIterator.hasNext()) {
+  int node = nodesIterator.nextInt();
+  hnswGraph.seek(level, node);
+  int nbr, lastNeighbor = -1, firstNeighbor = -1;
+  while ((nbr = hnswGraph.nextNeighbor()) != NO_MORE_DOCS) {
+if (firstNeighbor == -1) {

Review Comment:
   Sounds good, I'll add a TODO (and add those checks in a followup PR). A 
couple questions first: 
   - Doesn't (neighbors are in order) and (neighbors are not repeated) verify 
uniqueness?
   - Also, why wouldn't we assert neighbors are on this level for level 0?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Update lastDoc in ScoreCachingWrappingScorer [lucene]

2024-11-13 Thread via GitHub


msfroh commented on code in PR #13987:
URL: https://github.com/apache/lucene/pull/13987#discussion_r1841119813


##
lucene/core/src/test/org/apache/lucene/search/TestScoreCachingWrappingScorer.java:
##
@@ -157,4 +157,40 @@ public void testGetScores() throws Exception {
 ir.close();
 directory.close();
   }
+
+  private static class CountingScorable extends FilterScorable {
+int count = 0;
+
+public CountingScorable(Scorable in) {
+  super(in);
+}
+
+@Override
+public float score() throws IOException {
+  count++;
+  return in.score();
+}
+  }
+
+  public void testRepeatedCollectReusesScore() throws Exception {
+Scorer s = new SimpleScorer();
+CountingScorable countingScorable = new CountingScorable(s);
+ScoreCachingCollector scc = new ScoreCachingCollector(scores.length * 2);
+LeafCollector lc = scc.getLeafCollector(null);
+lc.setScorer(countingScorable);
+
+// We need to iterate on the scorer so that its doc() advances.
+int doc;
+while ((doc = s.iterator().nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
+  lc.collect(doc);
+  lc.collect(doc);

Review Comment:
   The new test does fail if I revert the fix. 
   
   As I mentioned in 
https://github.com/apache/lucene/pull/13987#issuecomment-2472044035, the 
condition for this being an actual bug is probably not really valid (i.e. you 
need to call `collect()` on the `ScoreCachingWrappingLeafCollector` multiple 
times for the same document).
   
   The caching *does* work if the wrapped `LeafCollector` calls `score()` 
multiple times, which can happen in `MultiCollector` (for example). The 
layering is roughly:
   
   ```
   ScoreCachingWrappingLeafCollector wraps
 Some other LeafCollector whose score is
   ScoreCachingWrappingScorer, which wraps
 The real Scorable
   ```
   
   The multiple calls to `score()` from `Some other LeafCollector` work just 
fine. Again, it's the call to `collect()` on 
`ScoreCachingWrappingLeafCollector` that was invalidating the cache.
   
   Those other two tests broke because my "fix" causes the caching to work (but 
they didn't expect it).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]

2024-11-13 Thread via GitHub


benwtrent commented on code in PR #13990:
URL: https://github.com/apache/lucene/pull/13990#discussion_r1841121411


##
lucene/join/src/java/org/apache/lucene/search/join/DiversifyingChildrenByteKnnVectorQuery.java:
##
@@ -154,7 +154,14 @@ protected TopDocs approximateSearch(
 
   @Override
   public String toString(String field) {
-return getClass().getSimpleName() + ":" + this.field + "[" + query[0] + 
",...][" + k + "]";

Review Comment:
   i agree with @jpountz, I am less convinced that as it is right now, that 
printing the parentsFilter is useful



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adding filter to the toString() method of KnnFloatVectorQuery [lucene]

2024-11-13 Thread via GitHub


benwtrent commented on PR #13990:
URL: https://github.com/apache/lucene/pull/13990#issuecomment-2474732022

   > Unrelated: I noticed that the DiversifyingChildren* queries don't use the 
filter to evaluate the query, is this a bug?
   
   I am unsure what you mean by this? The filter is utilized when building the 
accepted docs set in the super() classes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Update lastDoc in ScoreCachingWrappingScorer [lucene]

2024-11-13 Thread via GitHub


msfroh commented on code in PR #13987:
URL: https://github.com/apache/lucene/pull/13987#discussion_r1841125795


##
lucene/core/src/test/org/apache/lucene/search/TestScoreCachingWrappingScorer.java:
##
@@ -157,4 +157,40 @@ public void testGetScores() throws Exception {
 ir.close();
 directory.close();
   }
+
+  private static class CountingScorable extends FilterScorable {
+int count = 0;
+
+public CountingScorable(Scorable in) {
+  super(in);
+}
+
+@Override
+public float score() throws IOException {
+  count++;
+  return in.score();
+}
+  }
+
+  public void testRepeatedCollectReusesScore() throws Exception {
+Scorer s = new SimpleScorer();
+CountingScorable countingScorable = new CountingScorable(s);
+ScoreCachingCollector scc = new ScoreCachingCollector(scores.length * 2);
+LeafCollector lc = scc.getLeafCollector(null);
+lc.setScorer(countingScorable);
+
+// We need to iterate on the scorer so that its doc() advances.
+int doc;
+while ((doc = s.iterator().nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
+  lc.collect(doc);
+  lc.collect(doc);

Review Comment:
   Yeah -- that's why I'd hesitate to really call this a "bug fix". 
   
   I'm pretty sure you meant to write `lastDoc = currDoc` in 
https://github.com/apache/lucene/pull/12407, but functionally `currDoc = 
lastDoc` works fine as long as you don't call `collect()` multiple times on the 
same doc.
   
   I only noticed this because I was [copy/pasting the 
logic](https://github.com/opensearch-project/OpenSearch/pull/16366/files#diff-45fac6161bed1775163b916fba9ea296fbe9a8adf931a48593672c5c558a114b)
 to adapt it for `LeafBucketCollector` in OpenSearch and IntelliJ warned me 
that `lastDoc` could be `final`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Parse escaped brackets and spaces in range queries [lucene]

2024-11-13 Thread via GitHub


github-actions[bot] commented on PR #13887:
URL: https://github.com/apache/lucene/pull/13887#issuecomment-2475092304

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add a Multi-Vector Similarity Function [lucene]

2024-11-13 Thread via GitHub


vigyasharma commented on code in PR #13991:
URL: https://github.com/apache/lucene/pull/13991#discussion_r1841544154


##
lucene/core/src/java/org/apache/lucene/index/MultiVectorSimilarityFunction.java:
##
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.index;
+
+import java.util.ArrayList;
+import java.util.List;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * Computes similarity between two multi-vectors.
+ *
+ * A multi-vector is a collection of multiple vectors that represent a 
single document or query.
+ * MultiVectorSimilarityFunction is used to determine nearest neighbors during 
indexing and search
+ * on multi-vectors.
+ */
+public class MultiVectorSimilarityFunction {
+
+  /** Aggregation function to combine similarity across multiple vector values 
*/
+  public enum Aggregation {
+
+/**
+ * Sum_Max Similarity between two multi-vectors. Computes the sum of 
maximum similarity found
+ * for each vector in the first multi-vector against all vectors in the 
second multi-vector.
+ */
+SUM_MAX {
+  @Override
+  public float aggregate(
+  float[] outer,
+  float[] inner,
+  VectorSimilarityFunction vectorSimilarityFunction,
+  int dimension) {
+if (outer.length % dimension != 0 || inner.length % dimension != 0) {
+  throw new IllegalArgumentException("Multi vectors do not match 
provided dimension value");
+}
+
+// TODO: can we avoid making vector copies?
+List outerList = new ArrayList<>();
+List innerList = new ArrayList<>();
+for (int i = 0; i < outer.length; i += dimension) {
+  outerList.add(ArrayUtil.copyOfSubArray(outer, i, i + dimension));
+}
+for (int i = 0; i < inner.length; i += dimension) {
+  innerList.add(ArrayUtil.copyOfSubArray(inner, i, i + dimension));
+}
+
+float result = 0f;
+for (float[] o : outerList) {
+  float maxSim = Float.MIN_VALUE;
+  for (float[] i : innerList) {
+maxSim = Float.max(maxSim, vectorSimilarityFunction.compare(o, i));

Review Comment:
   Right, but it needs to go all the way down to `VectorUtilSupport`, which I 
think should be a PR of its own.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add a Multi-Vector Similarity Function [lucene]

2024-11-13 Thread via GitHub


dungba88 commented on code in PR #13991:
URL: https://github.com/apache/lucene/pull/13991#discussion_r1841484667


##
lucene/core/src/java/org/apache/lucene/index/MultiVectorSimilarityFunction.java:
##
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.index;
+
+import java.util.ArrayList;
+import java.util.List;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * Computes similarity between two multi-vectors.
+ *
+ * A multi-vector is a collection of multiple vectors that represent a 
single document or query.
+ * MultiVectorSimilarityFunction is used to determine nearest neighbors during 
indexing and search
+ * on multi-vectors.
+ */
+public class MultiVectorSimilarityFunction {
+
+  /** Aggregation function to combine similarity across multiple vector values 
*/
+  public enum Aggregation {
+
+/**
+ * Sum_Max Similarity between two multi-vectors. Computes the sum of 
maximum similarity found
+ * for each vector in the first multi-vector against all vectors in the 
second multi-vector.
+ */
+SUM_MAX {
+  @Override
+  public float aggregate(
+  float[] outer,
+  float[] inner,
+  VectorSimilarityFunction vectorSimilarityFunction,
+  int dimension) {
+if (outer.length % dimension != 0 || inner.length % dimension != 0) {
+  throw new IllegalArgumentException("Multi vectors do not match 
provided dimension value");
+}
+
+// TODO: can we avoid making vector copies?
+List outerList = new ArrayList<>();
+List innerList = new ArrayList<>();
+for (int i = 0; i < outer.length; i += dimension) {
+  outerList.add(ArrayUtil.copyOfSubArray(outer, i, i + dimension));
+}
+for (int i = 0; i < inner.length; i += dimension) {
+  innerList.add(ArrayUtil.copyOfSubArray(inner, i, i + dimension));
+}
+
+float result = 0f;
+for (float[] o : outerList) {
+  float maxSim = Float.MIN_VALUE;
+  for (float[] i : innerList) {
+maxSim = Float.max(maxSim, vectorSimilarityFunction.compare(o, i));

Review Comment:
   If we add another `compare` method with start and end indexes for both inner 
and outer, I guess we won't need to copy the array?



##
lucene/core/src/java/org/apache/lucene/index/MultiVectorSimilarityFunction.java:
##
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.index;
+
+import java.util.ArrayList;
+import java.util.List;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * Computes similarity between two multi-vectors.
+ *
+ * A multi-vector is a collection of multiple vectors that represent a 
single document or query.
+ * MultiVectorSimilarityFunction is used to determine nearest neighbors during 
indexing and search
+ * on multi-vectors.
+ */
+public class MultiVectorSimilarityFunction {
+
+  /** Aggregation function to combine similarity across multiple vector values 
*/
+  public enum Aggregation {
+
+/**
+ * Sum_Max Similarity between two multi-vectors. Computes the sum of 
maximum similarity found
+ * for each vector in the first multi-vector against all vectors in the 
second multi-vector.
+ */
+SUM_MAX {
+  @Override
+  public float aggregate(
+  float[] outer,
+  float[] inner,
+  VectorSimilarityFunction vectorSimilarityFunction,

Re: [PR] [KNN] Add comment and remove duplicate code [lucene]

2024-11-13 Thread via GitHub


dungba88 commented on PR #13594:
URL: https://github.com/apache/lucene/pull/13594#issuecomment-2475301572

   Thanks Kaival for reviewing and approving.
   
   Could someone from Lucene committers help review and merge this PR if it 
looks good?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Break the loop when segment is fully deleted by prior delTerms or delQueries [lucene]

2024-11-13 Thread via GitHub


vsop-479 commented on PR #13398:
URL: https://github.com/apache/lucene/pull/13398#issuecomment-2475304269

   Hi @mikemccand ,
   I think I addressed all your comments, Please take a look when you get a 
chance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Change vector input from IndexInput to RandomAccessInput [lucene]

2024-11-13 Thread via GitHub


dungba88 commented on code in PR #13981:
URL: https://github.com/apache/lucene/pull/13981#discussion_r1841492890


##
lucene/core/src/java/org/apache/lucene/store/RandomAccessInput.java:
##
@@ -77,4 +85,6 @@ default void readBytes(long pos, byte[] bytes, int offset, 
int length) throws IO
* @see IndexInput#prefetch
*/
   default void prefetch(long offset, long length) throws IOException {}
+
+  Object clone();

Review Comment:
   This baffles me a lot! I can't let the return type to be `RandomAccessInput` 
because it will conflict with the `DataInput.clone()` which returns `DataInput`.
   
   Adding the clone method seems a bit weird to me, but I'm not sure if there 
is a better way to go around it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [DRAFT] Change vector input from IndexInput to RandomAccessInput [lucene]

2024-11-13 Thread via GitHub


dungba88 commented on code in PR #13981:
URL: https://github.com/apache/lucene/pull/13981#discussion_r1841491566


##
lucene/core/src/java21/org/apache/lucene/internal/vectorization/Lucene99MemorySegmentByteVectorScorer.java:
##
@@ -40,8 +41,14 @@ abstract sealed class Lucene99MemorySegmentByteVectorScorer
* returned.
*/
   public static Optional create(
-  VectorSimilarityFunction type, IndexInput input, KnnVectorValues values, 
byte[] queryVector) {
+  VectorSimilarityFunction type,
+  RandomAccessInput slice,
+  KnnVectorValues values,
+  byte[] queryVector) {
 assert values instanceof ByteVectorValues;
+if (!(slice instanceof IndexInput input)) {

Review Comment:
   It's used in the next block though
   
   ```
   input = FilterIndexInput.unwrapOnlyTest(input);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use Arrays.mismatch in FSTCompiler#add. [lucene]

2024-11-13 Thread via GitHub


vsop-479 commented on PR #13924:
URL: https://github.com/apache/lucene/pull/13924#issuecomment-2475534092

   Thanks for your review, @dungba88 .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Move vector search from IndexInput to RandomAccessInput [lucene]

2024-11-13 Thread via GitHub


dungba88 commented on issue #13938:
URL: https://github.com/apache/lucene/issues/13938#issuecomment-2475621958

   @jpountz it's only a draft (I need to add tests), but can you give some 
feedbacks on https://github.com/apache/lucene/pull/13981. I'm not sure if I 
have fully captured the intention of this change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org