Tim-Brooks commented on PR #14213: URL: https://github.com/apache/lucene/pull/14213#issuecomment-2640892230
I am opening this proposed change to support writing a stored field from a byte source which does not require a contiguous array allocation. The reason I am proposing this is because there are times when we would like to store large stored fields and the requirement to provide a fully contiguous byte array can cause issues on smaller heaps. Particularly when the original data is already on heap in a non-contiguous source. I took a stab at this using a `DataInput` as a source for the indexing. I went with this initial approach as it aligns with the fact that `StoredFieldsWriter` already supports `DataInput` (seemingly for merges). I wrapped the `DataInput` in a record style class `StoredFieldDataInput` to associated `length` with it. If this approach has support I will continue to refine the PR. In particular, I was uncertain whether Lucene would want `DataInput` to be fully support in `Field` similar to `stringValue`, `readerValue`, `doubleValue`, etc, etc (with getters and setters). Or stick with what I did where it is only really supported in `StoredField` as a `storedValue`. Also I would be interested in what additional test classes I should modify with this type of ensure coverage. Finally, would we want to modify `StoredFieldsWriter#writeField(FieldInfo info, DataInput value, int length)` to use this `StoredFieldDataInput` abstraction instead of also having the length int (if others support the introduction of this abstraction)? DataInput is only one potential approach. I took it because there was already some work around `DataInput` with stored fields. A `ByteRef[]` or `ByteBuffer[]` would also work for our use. `DataInput` has the downside of requiring a local intermediate buffer in `ByteBuffersDataOutput` to copy into direct bytes. `BytesRef[]` would work but then not allow direct memory as a source (doesn't matter to our use case but worth noting). `ByteBuffer[]` supports everything (direct, no intermediate buffer) but is theoretically a bit less flexible than `DataInput` which is a very flexible abstraction. Any of these approaches are fine for my use case and I would be happy to work on whichever has the most support and consensus. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org