viirya opened a new issue, #74:
URL: https://github.com/apache/arrow-java/issues/74

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   This bug is found during debugging the issue 
https://github.com/apache/datafusion-comet/issues/540.
   
   We found that some string arrays' offsets are out of the range of their 
value buffer. I.e., a string array's value buffer is only 147456 bytes, but the 
offsets of the last string is (294894, 294912).
   
   The string is output from DataFusion aggregation operator `AggregateExec`. 
When producing the output batch, the operator will possibly slice the output 
batch if it is larger than a configured size. The slice of a string array in 
arrow-rs, keeps original value buffer and moves the pointer of offset buffer so 
it is a zero-copy slice.
   
   During `importOffsets` call in `BufferImportTypeVisitor`, the slice of 
offset can be imported correctly as it uses the moved pointer and calculates 
the offset buffer correctly.
   
   But when it goes to import value buffer, it calculates the capacity of it by 
using the difference between imported last offset and first offset. Because the 
imported offsets are from the slice, the calculated capacity is only for the 
certain slice of the value buffer.
   
   For example, the original string array's value buffer is 346536 bytes, last 
offset is 346536. We take a slice of 8192 strings from it. The slice array's 
last offset is 294912 but the value buffer is the same (346536 bytes).
   
   When `BufferImportTypeVisitor` imports the slice, the imported offsets are 
[147456, ..., 294912]. It calculates the length of value buffer is `294912 - 
147456 = 147456`. But actually the length of value buffer is 346536.
   
   Obviously the offsets are now out of range of the incorrect value buffer 
size 147456.
   
   To be clear, the source of the issue comes from 
https://github.com/apache/arrow-rs/issues/5896 where it exports moved pointer 
of offer buffer of a slice of string array. Instead, we should use `offset` 
field in `ArrowArray` for it. We are going to fix it in arrow-rs for that.
   
   But actually seems either ArrayImporter or `BufferImportTypeVisitor` also 
doesn't consider `offset` in `ArrowArray`.
   
   ### Component(s)
   
   Java


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to