asfimport opened a new issue, #230:
URL: https://github.com/apache/arrow-java/issues/230
I encountered this bug when I loaded a dataframe stored in the Arrow IPC
format.
```java
// Java Code from "Apache Arrow Java Cookbook"
File file = new File("example.arrow");
try (
BufferAllocator rootAllocator = new RootAllocator();
FileInputStream fileInputStream = new FileInputStream(file);
ArrowFileReader reader = new
ArrowFileReader(fileInputStream.getChannel(), rootAllocator)
) {
System.out.println("Record batches in file: " +
reader.getRecordBlocks().size());
for (ArrowBlock arrowBlock : reader.getRecordBlocks()) {
reader.loadRecordBatch(arrowBlock);
VectorSchemaRoot vectorSchemaRootRecover =
reader.getVectorSchemaRoot();
System.out.print(vectorSchemaRootRecover.contentToTSVString());
}
} catch (IOException e) {
e.printStackTrace();
}
```
Call stack:
```
Exception in thread "main" java.lang.IndexOutOfBoundsException: index: 0,
length: 2048 (expected: range(0, 2024))
at org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:701)
at org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:955)
at
org.apache.arrow.vector.BaseFixedWidthVector.reAlloc(BaseFixedWidthVector.java:451)
at
org.apache.arrow.vector.BaseFixedWidthVector.setValueCount(BaseFixedWidthVector.java:732)
at
org.apache.arrow.vector.VectorSchemaRoot.setRowCount(VectorSchemaRoot.java:240)
at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:86)
at
org.apache.arrow.vector.ipc.ArrowReader.loadRecordBatch(ArrowReader.java:220)
at
org.apache.arrow.vector.ipc.ArrowFileReader.loadNextBatch(ArrowFileReader.java:166)
at
org.apache.arrow.vector.ipc.ArrowFileReader.loadRecordBatch(ArrowFileReader.java:197)
```
This bug can be reproduced by a simple dataframe created by pandas:
```java
pd.DataFrame({'a': range(10000)}).to_feather('example.arrow')
```
Pandas compresses the dataframe by default. If the compression is turned
off, Java can load the dataframe. Thus, I guess the bounds checking code is
buggy when loading compressed file.
That dataframe can be loaded in polars, pandas and pyarrow, so it's unlikely
to be a pandas bug.
**Environment**: Linux and Windows.
Apache Arrow Java version: 10.0.0, 9.0.0, 4.0.1.
Pandas 1.4.2 using pyarrow 8.0.0 (anaconda3-2022.05)
**Reporter**: [Georeth
Zhou](https://issues.apache.org/jira/browse/ARROW-18198)
<sub>**Note**: *This issue was originally created as
[ARROW-18198](https://issues.apache.org/jira/browse/ARROW-18198). Please see
the [migration documentation](https://github.com/apache/arrow/issues/14542) for
further details.*</sub>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]