viirya opened a new issue, #74: URL: https://github.com/apache/arrow-java/issues/74
### Describe the bug, including details regarding any error messages, version, and platform. This bug is found during debugging the issue https://github.com/apache/datafusion-comet/issues/540. We found that some string arrays' offsets are out of the range of their value buffer. I.e., a string array's value buffer is only 147456 bytes, but the offsets of the last string is (294894, 294912). The string is output from DataFusion aggregation operator `AggregateExec`. When producing the output batch, the operator will possibly slice the output batch if it is larger than a configured size. The slice of a string array in arrow-rs, keeps original value buffer and moves the pointer of offset buffer so it is a zero-copy slice. During `importOffsets` call in `BufferImportTypeVisitor`, the slice of offset can be imported correctly as it uses the moved pointer and calculates the offset buffer correctly. But when it goes to import value buffer, it calculates the capacity of it by using the difference between imported last offset and first offset. Because the imported offsets are from the slice, the calculated capacity is only for the certain slice of the value buffer. For example, the original string array's value buffer is 346536 bytes, last offset is 346536. We take a slice of 8192 strings from it. The slice array's last offset is 294912 but the value buffer is the same (346536 bytes). When `BufferImportTypeVisitor` imports the slice, the imported offsets are [147456, ..., 294912]. It calculates the length of value buffer is `294912 - 147456 = 147456`. But actually the length of value buffer is 346536. Obviously the offsets are now out of range of the incorrect value buffer size 147456. To be clear, the source of the issue comes from https://github.com/apache/arrow-rs/issues/5896 where it exports moved pointer of offer buffer of a slice of string array. Instead, we should use `offset` field in `ArrowArray` for it. We are going to fix it in arrow-rs for that. But actually seems either ArrayImporter or `BufferImportTypeVisitor` also doesn't consider `offset` in `ArrowArray`. ### Component(s) Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org