[PR] Introduces IndexInput#updateReadAdvice to change the readadvice while [lucene]

2024-11-09 Thread via GitHub


shatejas opened a new pull request, #13985:
URL: https://github.com/apache/lucene/pull/13985

   reading from IndexInput
   
   The change is needed to be able to reduce the force merge time. 
Lucene99FlatVectorsReader is opened with IOContext.RANDOM, this optimizes 
searches with madvise as RANDOM. For merges we need sequential access and 
ability to preload pages to be able to shorten the merge time.
   
   The change updates the ReadAdvice.SEQUENTIAL before the merge starts and 
reverts it to ReadAdvice.RANDOM at the end of the merge for
   Lucene99FlatVectorsReader.
   
   ### Description
   
   Benchmarking results coming up. Opening as a draft to get initial feedback
   
   ### Related issues
   https://github.com/apache/lucene/issues/13920
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Introduces IndexInput#updateReadAdvice to change the readadvice while [lucene]

2024-11-09 Thread via GitHub


shatejas commented on PR #13985:
URL: https://github.com/apache/lucene/pull/13985#issuecomment-2466151296

   cc: @uschindler @jpountz @navneet1v 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use multi-select instead of a full sort for DynamicRange creation [lucene]

2024-11-09 Thread via GitHub


github-actions[bot] commented on PR #13914:
URL: https://github.com/apache/lucene/pull/13914#issuecomment-2466522248

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree [lucene]

2024-11-09 Thread via GitHub


github-actions[bot] commented on PR #13521:
URL: https://github.com/apache/lucene/pull/13521#issuecomment-2466522392

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-11-09 Thread via GitHub


krickert commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2466210357

   > My current thinking is that this is a rapidly evolving field, and it's 
early to lean one way or another. Adding this support unlocks experimentation.
   
   Amen! 
   
   This ends up being so domain specific.  Multi-embeddings become key when you 
deal with domain voids in the LLMs used to create the embeddings.  That's most 
big corpuses.   So at least being able to experiment would get you far more 
feedback.
   
   I would be ok with writing some tests if that helps.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Introduces IndexInput#updateReadAdvice to change the readadvice while [lucene]

2024-11-09 Thread via GitHub


jpountz commented on code in PR #13985:
URL: https://github.com/apache/lucene/pull/13985#discussion_r1835530675


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsReader.java:
##
@@ -170,6 +170,15 @@ public void checkIntegrity() throws IOException {
 CodecUtil.checksumEntireFile(vectorData);
   }
 
+  @Override
+  public FlatVectorsReader getMergeInstance() {
+try {
+  return new MergeLucene99FlatVectorsReader(this);
+} catch (IOException exception) {
+  throw new RuntimeException(exception);

Review Comment:
   ```suggestion
 throw new UncheckedIOException(exception);
   ```



##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java:
##
@@ -113,6 +114,25 @@ public Lucene99HnswVectorsReader(SegmentReadState state, 
FlatVectorsReader flatV
 }
   }
 
+  private Lucene99HnswVectorsReader(
+  Lucene99HnswVectorsReader reader, KnnVectorsReader flatVectorsReader) {
+assert flatVectorsReader instanceof FlatVectorsReader;

Review Comment:
   I would rather make this ctor take a `FlatVectorsReader` and push the 
responsibility to callers to make the cast (or maybe we don't even need a cast 
if we make getMergeInstance() return a `FlatVectorsReader`?)



##
lucene/test-framework/src/java/org/apache/lucene/tests/store/BaseDirectoryTestCase.java:
##
@@ -1554,6 +1555,42 @@ public void testPrefetchOnSlice() throws IOException {
 doTestPrefetch(TestUtil.nextInt(random(), 1, 1024));
   }
 
+  public void testUpdateReadAdvice() throws IOException {
+try (Directory dir = getDirectory(createTempDir("testUpdateReadAdvice"))) {
+  final int totalLength = TestUtil.nextInt(random(), 16384, 65536);
+  byte[] arr = new byte[totalLength];
+  random().nextBytes(arr);
+  try (IndexOutput out = dir.createOutput("temp.bin", IOContext.DEFAULT)) {
+out.writeBytes(arr, arr.length);
+  }
+
+  try (IndexInput orig = dir.openInput("temp.bin", IOContext.DEFAULT)) {
+IndexInput in = random().nextBoolean() ? orig.clone() : orig;
+// Read advice updated at start
+orig.updateReadAdvice(randomReadAdvice());
+for (int i = 0; i < totalLength; i++) {
+  int offset = TestUtil.nextInt(random(), 0, (int) in.length() - 1);
+  in.seek(offset);
+  assertEquals(arr[offset], in.readByte());
+}
+
+// Updating readAdvice in the middle
+for (int i = 0; i < 10_000; ++i) {
+  int offset = TestUtil.nextInt(random(), 0, (int) in.length() - 1);
+  in.seek(offset);
+  assertEquals(arr[offset], in.readByte());
+  if (random().nextBoolean()) {
+orig.updateReadAdvice(randomReadAdvice());
+  }
+}
+  }
+}
+  }
+
+  private ReadAdvice randomReadAdvice() {
+return ReadAdvice.values()[TestUtil.nextInt(random(), 0, 
ReadAdvice.values().length - 1)];

Review Comment:
   You could use `RandomPicks#randomFrom`.



##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java:
##
@@ -123,4 +123,11 @@ public abstract void search(
   public KnnVectorsReader getMergeInstance() {
 return this;
   }
+
+  /**
+   * Optional: reset or close merge resources used in the reader
+   *
+   * The default implementation is empty
+   */
+  public void finishMerge() throws IOException {}

Review Comment:
   Separately, it would be nice to improve the asserting framework 
(`AssertingKnnVectorsReader` in this case) to validate that `finishMerge()` is 
always called on merge instances.



##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java:
##
@@ -123,4 +123,11 @@ public abstract void search(
   public KnnVectorsReader getMergeInstance() {
 return this;
   }
+
+  /**
+   * Optional: reset or close merge resources used in the reader
+   *
+   * The default implementation is empty
+   */
+  public void finishMerge() throws IOException {}

Review Comment:
   I wonder if we should reuse the close() method of merge instances for this, 
what do you think @uschindler ?



##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsReader.java:
##
@@ -327,4 +336,61 @@ static FieldEntry create(IndexInput input, FieldInfo info) 
throws IOException {
   info);
 }
   }
+
+  private static final class MergeLucene99FlatVectorsReader extends 
FlatVectorsReader {
+
+private final Lucene99FlatVectorsReader delegate;
+
+MergeLucene99FlatVectorsReader(final Lucene99FlatVectorsReader 
flatVectorsReader)

Review Comment:
   Nit: I'm not sure I would create a wrapper, maybe we should just add a copy 
constructor and have a `boolean merging` instance like we do e.g. on stored 
fields (see `Lucene90CompressingStoredFieldsReader`).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above