iverase opened a new issue, #12459: URL: https://github.com/apache/lucene/issues/12459
### Description Binary doc values allow to store a variable number of bytes on a doc value. In order to read those bytes, we currently get a BytesRef from the API which contains the bytes on heap. In order to do that, the current implementation preallocates a byte array with the size equals to the biggest doc value. This strategy has two main drawbacks: 1) We are using a byte array as a middle data structure so we are copying each doc value in this byte array. In many cases this is an unnecessary overhead. 2) when one of the doc values is big, in the order of few megabytes, it can cause issues with small heaps (or even big heaps if big enough). This is due to the allocation of a big byte arrays upfront, that can be consider humongous allocations by the G1 garbage collector and it can cause heap issues under high load. Therefore I would like to propose to add a new API on top of binary doc values that allows reading them using a DataInput. This data input can read directly from the underlaying IndexInput and therefore we don't need to copy data into an intermediate data structure and we don't need to preallocate a byte array. The new API would be built on top of the existing binary doc values and it would look something like: ``` public abstract class DataInputDocValues extends DocValuesIterator { /** Sole constructor. (For invocation by subclass constructors, typically implicit.) */ protected DataInputDocValues() {} /** * Returns the binary value wrapped as a {@link DataInput} for the current document ID. It is * illegal to call this method after {@link #advanceExact(int)} returned {@code false}. * * @return the binary value wrapped as a {@link DataInput} */ public abstract DataInput dataInputValue() throws IOException; } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org