Tim-Brooks commented on PR #14213:
URL: https://github.com/apache/lucene/pull/14213#issuecomment-2640892230

   I am opening this proposed change to support writing a stored field from a 
byte source which does not require a contiguous array allocation. The reason I 
am proposing this is because there are times when we would like to store large 
stored fields and the requirement to provide a fully contiguous byte array can 
cause issues on smaller heaps. Particularly when the original data is already 
on heap in a non-contiguous source.
   
   I took a stab at this using a `DataInput` as a source for the indexing. I 
went with this initial approach as it aligns with the fact that 
`StoredFieldsWriter` already supports `DataInput` (seemingly for merges).
   
   I wrapped the `DataInput` in a record style class `StoredFieldDataInput` to 
associated `length` with it.
   
   If this approach has support I will continue to refine the PR. In 
particular, I was uncertain whether Lucene would want `DataInput` to be fully 
support in `Field` similar to `stringValue`, `readerValue`, `doubleValue`, etc, 
etc (with getters and setters). Or stick with what I did where it is only 
really supported in `StoredField` as a `storedValue`. Also I would be 
interested in what additional test classes I should modify with this type of 
ensure coverage.
   
   Finally, would we want to modify `StoredFieldsWriter#writeField(FieldInfo 
info, DataInput value, int length)` to use this `StoredFieldDataInput` 
abstraction instead of also having the length int (if others support the 
introduction of this abstraction)?
   
   DataInput is only one potential approach. I took it because there was 
already some work around `DataInput` with stored fields.
   
   A `ByteRef[]` or `ByteBuffer[]` would also work for our use. `DataInput` has 
the downside of requiring a local intermediate buffer in 
`ByteBuffersDataOutput` to copy into direct bytes. `BytesRef[]` would work but 
then not allow direct memory as a source (doesn't matter to our use case but 
worth noting). `ByteBuffer[]` supports everything (direct, no intermediate 
buffer) but is theoretically a bit less flexible than `DataInput` which is a 
very flexible abstraction.
   
   Any of these approaches are fine for my use case and I would be happy to 
work on whichever has the most support and consensus.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to