[I] [Java] Unexpected RecordBatch length when saving empty table to file with compression [arrow-java]

via GitHub Tue, 26 Nov 2024 11:22:31 -0800


DrChainsaw opened a new issue, #194:
URL: https://github.com/apache/arrow-java/issues/194


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   This might be more of a usage question since I couldn't find anything in the 
format docs on how to set the length field with compression.
   
   The issue is that if I try to read an empty table with the [Julia 
extension](https://github.com/apache/arrow-julia) it just hangs. The reason for 
this seems to be that it [only 
checks](https://github.com/apache/arrow-julia/blob/e893c327f177f5a4d5efeab831df0fe93ab4ec5b/src/table.jl#L518-L529)
 the length field in the RecordBatch when deciding whether to attempt to decode 
and not the length read from the first 8 bytes of the data.
   
   The file created by the code below is readable by both pyarrow and the java 
implementation, so chances are that the Julia implementation is doing it wrong 
(I will open an issue there as well). Is there some reference to how one shall 
interpret the length field in RecordBatch when using compression? 
   
   <details><summary>Code to create an empty table in case I'm doing something 
wrong</summary>
   
   ```java
       public static void main(String[] args) {
           try (BufferAllocator allocator = new RootAllocator()) {
               Field name = new Field("name", FieldType.nullable(new 
ArrowType.Utf8()), null);
               Field age = new Field("age", FieldType.nullable(new 
ArrowType.Int(32, true)), null);
               Schema schemaPerson = new Schema(asList(name, age));
               try(
                       VectorSchemaRoot vectorSchemaRoot = 
VectorSchemaRoot.create(schemaPerson, allocator)
               ){
                   vectorSchemaRoot.allocateNew(); // Needed?
                   vectorSchemaRoot.setRowCount(0); // Needed?
                   File file = new File("randon_access_to_file.arrow");
                   try (
                           FileOutputStream fileOutputStream = new 
FileOutputStream(file);
                           ArrowFileWriter writer = new 
ArrowFileWriter(vectorSchemaRoot, null, fileOutputStream.getChannel(),
                                   null, IpcOption.DEFAULT,
                                   CommonsCompressionFactory.INSTANCE, 
CompressionUtil.CodecType.ZSTD)
                   ) {
                       writer.start();
                       writer.writeBatch();
                       writer.end();
                       System.out.println("Record batches written: " + 
writer.getRecordBlocks().size() + ". Number of rows written: " + 
vectorSchemaRoot.getRowCount());
                   } catch (IOException e) {
                       e.printStackTrace();
                   }
               }
           }
       }
   ```
   </details>
   
   When I tried saving a compressed empty table using pyarrow I got 0 as the 
length field and the Julia implementation could read the table without hanging.
   
   Disclaimer: I don't have a working python installation so I did this though 
PythonCall. Hopefully I managed to remove all the Julia-isms so that it runs in 
python:
   ```python
   schema = pa.schema([pa.field('nums', pa.int32())])
   
   with pa.OSFile('bigfile.arrow', 'wb') as sink:
      with pa.ipc.new_file(sink, schema, 
options=pa.ipc.IpcWriteOptions(compression='zstd'))) as writer:
            batch = pa.record_batch([pa.array([], type=pa.int32())], schema)
            writer.write(batch)
   ``` 
   
   ### Component(s)
   
   Java


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [Java] Unexpected RecordBatch length when saving empty table to file with compression [arrow-java]

Reply via email to