asfimport opened a new issue, #230:
URL: https://github.com/apache/arrow-java/issues/230

   I encountered this bug when I loaded a dataframe stored in the Arrow IPC 
format.
   
    
   ```java
   
   // Java Code from "Apache Arrow Java Cookbook"
   File file = new File("example.arrow");
   try (
           BufferAllocator rootAllocator = new RootAllocator();
           FileInputStream fileInputStream = new FileInputStream(file);
           ArrowFileReader reader = new 
ArrowFileReader(fileInputStream.getChannel(), rootAllocator)
   ) {
       System.out.println("Record batches in file: " + 
reader.getRecordBlocks().size());
       for (ArrowBlock arrowBlock : reader.getRecordBlocks()) {
           reader.loadRecordBatch(arrowBlock);
           VectorSchemaRoot vectorSchemaRootRecover = 
reader.getVectorSchemaRoot();
           System.out.print(vectorSchemaRootRecover.contentToTSVString());
       }
   } catch (IOException e) {
       e.printStackTrace();
   } 
   ```
   Call stack:
   ```
   
   Exception in thread "main" java.lang.IndexOutOfBoundsException: index: 0, 
length: 2048 (expected: range(0, 2024))
       at org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:701)
       at org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:955)
       at 
org.apache.arrow.vector.BaseFixedWidthVector.reAlloc(BaseFixedWidthVector.java:451)
       at 
org.apache.arrow.vector.BaseFixedWidthVector.setValueCount(BaseFixedWidthVector.java:732)
       at 
org.apache.arrow.vector.VectorSchemaRoot.setRowCount(VectorSchemaRoot.java:240)
       at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:86)
       at 
org.apache.arrow.vector.ipc.ArrowReader.loadRecordBatch(ArrowReader.java:220)
       at 
org.apache.arrow.vector.ipc.ArrowFileReader.loadNextBatch(ArrowFileReader.java:166)
       at 
org.apache.arrow.vector.ipc.ArrowFileReader.loadRecordBatch(ArrowFileReader.java:197)
   ```
   This bug can be reproduced by a simple dataframe created by pandas:
   
    
   ```java
   
   pd.DataFrame({'a': range(10000)}).to_feather('example.arrow') 
   ```
   Pandas compresses the dataframe by default. If the compression is turned 
off, Java can load the dataframe. Thus, I guess the bounds checking code is 
buggy when loading compressed file.
   
    
   
   That dataframe can be loaded in polars, pandas and pyarrow, so it's unlikely 
to be a pandas bug.
   
    
   
    
   
   **Environment**: Linux and Windows.
   Apache Arrow Java version: 10.0.0, 9.0.0, 4.0.1.
   Pandas 1.4.2 using pyarrow 8.0.0 (anaconda3-2022.05)
   **Reporter**: [Georeth 
Zhou](https://issues.apache.org/jira/browse/ARROW-18198)
   
   <sub>**Note**: *This issue was originally created as 
[ARROW-18198](https://issues.apache.org/jira/browse/ARROW-18198). Please see 
the [migration documentation](https://github.com/apache/arrow/issues/14542) for 
further details.*</sub>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to