jpountz commented on code in PR #13948: URL: https://github.com/apache/lucene/pull/13948#discussion_r1876152549
########## lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer.java: ########## @@ -791,25 +792,26 @@ public BinaryDocValues getBinary(FieldInfo field) throws IOException { return DocValues.emptyBinary(); } - final RandomAccessInput bytesSlice = data.randomAccessSlice(entry.dataOffset, entry.dataLength); + final RandomAccessInputRef bytesSlice = + new RandomAccessInputRef(data.randomAccessSlice(entry.dataOffset, entry.dataLength)); Review Comment: Nit, maybe extract two variables: `bytesSlice` and `byteSliceRef` to make it a bit easier to read? ########## lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextDocValuesReader.java: ########## @@ -383,17 +392,31 @@ public long cost() { @Override public int advance(int target) throws IOException { - return docsWithField.advance(target); + int doc = docsWithField.advance(target); + if (doc != NO_MORE_DOCS) { + consume(values.apply(doc)); + } + return doc; + } + + private void consume(BytesRef bytesRef) { + bytes.reset(bytesRef.bytes); + inputRef.offset = bytesRef.offset; + inputRef.length = bytesRef.length; } @Override public boolean advanceExact(int target) throws IOException { - return docsWithField.advanceExact(target); + if (docsWithField.advanceExact(target)) { + consume(values.apply(target)); + return true; + } + return false; } @Override - public BytesRef binaryValue() throws IOException { - return values.apply(docsWithField.docID()); + public RandomAccessInputRef randomAccessInputValue() { + return inputRef; Review Comment: What was the problem with calling `values#apply` in `randomAccessInputValue()` rather than in `nextDoc()`/`advance()`/`advanceExact()`? ########## lucene/core/src/java/org/apache/lucene/util/UnicodeUtil.java: ########## @@ -627,35 +629,58 @@ public static String toHexString(String s) { } /** - * Interprets the given byte array as UTF-8 and converts to UTF-16. It is the responsibility of - * the caller to make sure that the destination array is large enough. + * Interprets the given {@link RandomAccessInput} slice as UTF-8 and converts to UTF-16. It is the + * responsibility of the caller to make sure that the destination array is large enough. + * + * <p>NOTE: Full characters are read, even if this reads past the length passed (and can result in + * an IOException if invalid UTF-8 is passed). Explicit checks for valid UTF-8 are not performed. + */ + // TODO: broken if chars.offset != 0 + public static int UTF8toUTF16(RandomAccessInput input, long offset, int length, char[] out) Review Comment: Having `UnicodeUtil` need to work on a `RandomAccessInput` doesn't feel right. Do you know if this is performance-sensitive? Otherwise I'd rather make `RandomAccessInputRef` implement `utf8ToString()` by converting to a `BytesRef` first. ########## lucene/core/src/java/org/apache/lucene/store/RandomAccessInputDataInput.java: ########## @@ -0,0 +1,103 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.store; + +import java.io.IOException; +import org.apache.lucene.util.RandomAccessInputRef; + +/** + * DataInput backed by a {@link RandomAccessInput}. <b>WARNING:</b> This class omits all low-level + * checks. + * + * @lucene.experimental + */ +public final class RandomAccessInputDataInput extends DataInput { + + private RandomAccessInput input; + private long offset; + private long length; + + private long pos; Review Comment: I wonderif `pos` should start at `offset` rather than `0` so that we don't have to sum up `offset` and `pos` whenever we need to know the current position? ########## lucene/core/src/java/org/apache/lucene/index/BinaryDocValues.java: ########## @@ -18,19 +18,20 @@ package org.apache.lucene.index; import java.io.IOException; -import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.RandomAccessInputRef; -/** A per-document numeric value. */ +/** A per-document binary value. */ Review Comment: Woops, a 10 years old typo. :) ########## lucene/core/src/java/org/apache/lucene/search/FieldComparator.java: ########## @@ -234,7 +235,8 @@ public TermValComparator(int numHits, String field, boolean sortMissingLast) { private BytesRef getValueForDoc(int doc) throws IOException { if (docTerms.advanceExact(doc)) { - return docTerms.binaryValue(); + bytesRefBuilder.copyBytes(docTerms.randomAccessInputValue()); + return bytesRefBuilder.get(); Review Comment: Nit: it's a bit sad to load all bytes on top heap, when in many cases we could identify that the value is not competitive after reading a few bytes only? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org