jpountz commented on PR #12460: URL: https://github.com/apache/lucene/pull/12460#issuecomment-1715126194
The more I think of this change, the more I like it: most of the time, you would need to read data out of binary doc values, e.g. (variable-length) integers, strings, etc. and exposing binary doc values as a data input not only makes this easier to do, but also more efficient by saving one memory copy. Thoughts: - We are losing the symmetry between the index-time API (BinaryDocValuesField), which takes a byte[] while the search-time API produces a `DataInput`. Not a deal breaker, but users have to be careful about endianness. - Today it's possible to consume values multiple times, but the data input is stateful, so if one consumer reads data from it, and then another consumer tries to read data again it will fail unless it resets the input via a call to `advanceExact`. Again not a deal breaker but something that could be a surprise to some callers. In terms of backward compatibility, I'm contemplating not introducing a new `DataInputDocValues` class, and instead have a `dataInput()` method on `BinaryDocValues`, as well as a default implementation of `binaryValue()` that reads all bytes from the data input. What do you think? I'm also curious of the opinion of our sophisticated backward compatibility policeman @uschindler. To @stefanvodita 's point, it would be nice to migrate at least one or two call sites that would get simpler with a `DataInput` to further validate this change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org