iverase opened a new issue, #12459:
URL: https://github.com/apache/lucene/issues/12459

   ### Description
   
   Binary doc values allow to store a variable number of bytes on a doc value. 
In order to read those bytes, we currently get a BytesRef from the API which 
contains the bytes on heap. In order to do that, the current implementation 
preallocates a byte array with the size equals to the biggest doc value. This 
strategy has two main drawbacks:
   
   1) We are using a byte array as a middle data structure so we are copying 
each doc value in this byte array. In many cases this is an unnecessary 
overhead.
   
   2) when one of the doc values is big, in the order of few megabytes, it can 
cause issues with small heaps (or even big heaps if big enough). This is due to 
the allocation of a big byte arrays upfront, that can be consider humongous 
allocations by the G1 garbage collector and it can cause heap issues under high 
load.
   
   
   Therefore I would like to propose to add a new API on top of binary doc 
values that allows reading them using a DataInput. This data input can read 
directly from the underlaying IndexInput and therefore we don't need to copy 
data into an intermediate data structure and we don't need to preallocate a 
byte array.
   
   The new API would be built on top of the existing binary doc values and it 
would look something like:
   
   ```
   public abstract class DataInputDocValues extends DocValuesIterator {
   
     /** Sole constructor. (For invocation by subclass constructors, typically 
implicit.) */
     protected DataInputDocValues() {}
   
     /**
      * Returns the binary value wrapped as a {@link DataInput} for the current 
document ID. It is
      * illegal to call this method after {@link #advanceExact(int)} returned 
{@code false}.
      *
      * @return the binary value wrapped as a {@link DataInput}
      */
     public abstract DataInput dataInputValue() throws IOException;
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to