Re: [PR] Allow reading binary doc values as a RandomAccessInput [lucene]

via GitHub Mon, 09 Dec 2024 07:44:17 -0800


jpountz commented on code in PR #13948:
URL: https://github.com/apache/lucene/pull/13948#discussion_r1876152549



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer.java:
##########
@@ -791,25 +792,26 @@ public BinaryDocValues getBinary(FieldInfo field) throws 
IOException {
       return DocValues.emptyBinary();
     }
 
-    final RandomAccessInput bytesSlice = 
data.randomAccessSlice(entry.dataOffset, entry.dataLength);
+    final RandomAccessInputRef bytesSlice =
+        new RandomAccessInputRef(data.randomAccessSlice(entry.dataOffset, 
entry.dataLength));

Review Comment:
   Nit, maybe extract two variables: `bytesSlice` and `byteSliceRef` to make it 
a bit easier to read?



##########
lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextDocValuesReader.java:
##########
@@ -383,17 +392,31 @@ public long cost() {
 
       @Override
       public int advance(int target) throws IOException {
-        return docsWithField.advance(target);
+        int doc = docsWithField.advance(target);
+        if (doc != NO_MORE_DOCS) {
+          consume(values.apply(doc));
+        }
+        return doc;
+      }
+
+      private void consume(BytesRef bytesRef) {
+        bytes.reset(bytesRef.bytes);
+        inputRef.offset = bytesRef.offset;
+        inputRef.length = bytesRef.length;
       }
 
       @Override
       public boolean advanceExact(int target) throws IOException {
-        return docsWithField.advanceExact(target);
+        if (docsWithField.advanceExact(target)) {
+          consume(values.apply(target));
+          return true;
+        }
+        return false;
       }
 
       @Override
-      public BytesRef binaryValue() throws IOException {
-        return values.apply(docsWithField.docID());
+      public RandomAccessInputRef randomAccessInputValue() {
+        return inputRef;

Review Comment:
   What was the problem with calling `values#apply` in 
`randomAccessInputValue()` rather than in 
`nextDoc()`/`advance()`/`advanceExact()`?



##########
lucene/core/src/java/org/apache/lucene/util/UnicodeUtil.java:
##########
@@ -627,35 +629,58 @@ public static String toHexString(String s) {
   }
 
   /**
-   * Interprets the given byte array as UTF-8 and converts to UTF-16. It is 
the responsibility of
-   * the caller to make sure that the destination array is large enough.
+   * Interprets the given {@link RandomAccessInput} slice as UTF-8 and 
converts to UTF-16. It is the
+   * responsibility of the caller to make sure that the destination array is 
large enough.
+   *
+   * <p>NOTE: Full characters are read, even if this reads past the length 
passed (and can result in
+   * an IOException if invalid UTF-8 is passed). Explicit checks for valid 
UTF-8 are not performed.
+   */
+  // TODO: broken if chars.offset != 0
+  public static int UTF8toUTF16(RandomAccessInput input, long offset, int 
length, char[] out)

Review Comment:
   Having `UnicodeUtil` need to work on a `RandomAccessInput` doesn't feel 
right. Do you know if this is performance-sensitive? Otherwise I'd rather make 
`RandomAccessInputRef` implement `utf8ToString()` by converting to a `BytesRef` 
first.



##########
lucene/core/src/java/org/apache/lucene/store/RandomAccessInputDataInput.java:
##########
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.store;
+
+import java.io.IOException;
+import org.apache.lucene.util.RandomAccessInputRef;
+
+/**
+ * DataInput backed by a {@link RandomAccessInput}. <b>WARNING:</b> This class 
omits all low-level
+ * checks.
+ *
+ * @lucene.experimental
+ */
+public final class RandomAccessInputDataInput extends DataInput {
+
+  private RandomAccessInput input;
+  private long offset;
+  private long length;
+
+  private long pos;

Review Comment:
   I wonderif `pos` should start at `offset` rather than `0` so that we don't 
have to sum up `offset` and `pos` whenever we need to know the current position?



##########
lucene/core/src/java/org/apache/lucene/index/BinaryDocValues.java:
##########
@@ -18,19 +18,20 @@
 package org.apache.lucene.index;
 
 import java.io.IOException;
-import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.RandomAccessInputRef;
 
-/** A per-document numeric value. */
+/** A per-document binary value. */

Review Comment:
   Woops, a 10 years old typo. :)



##########
lucene/core/src/java/org/apache/lucene/search/FieldComparator.java:
##########
@@ -234,7 +235,8 @@ public TermValComparator(int numHits, String field, boolean 
sortMissingLast) {
 
     private BytesRef getValueForDoc(int doc) throws IOException {
       if (docTerms.advanceExact(doc)) {
-        return docTerms.binaryValue();
+        bytesRefBuilder.copyBytes(docTerms.randomAccessInputValue());
+        return bytesRefBuilder.get();

Review Comment:
   Nit: it's a bit sad to load all bytes on top heap, when in many cases we 
could identify that the value is not competitive after reading a few bytes only?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Allow reading binary doc values as a RandomAccessInput [lucene]

Reply via email to