[I] [Java][Python] Avoid Reuse VectorSchemaRoot for exporting ArrowArrayStream to other language [arrow-java]

via GitHub Tue, 26 Nov 2024 11:17:07 -0800


hu6360567 opened a new issue, #185:
URL: https://github.com/apache/arrow-java/issues/185


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I'm trying to import/export data to database in python through `ArrayStream` 
over pyarrow.jvm and JDBC.
   
   In order to export ArrowVectorIterator as stream without unloading to 
RecordBatch on java side before it export to stream, I wrap ArrowVectorIterator 
into ArrowReader as below:
   ```java
   public class ArrowVectorIteratorReader extends ArrowReader {
   
       private final Iterator<VectorSchemaRoot> iterator;
       private final Schema schema;
       private VectorSchemaRoot root;
   
       public ArrowVectorIteratorReader(BufferAllocator allocator, 
Iterator<VectorSchemaRoot> iterator, Schema schema) {
           super(allocator);
           this.iterator = iterator;
           this.schema = schema;
           this.root = null;
       }
   
       @Override
       public VectorSchemaRoot getVectorSchemaRoot() throws IOException {
           if (root == null) return super.getVectorSchemaRoot();
           return root;
       }
   
       @Override
       public boolean loadNextBatch() throws IOException {
           if (iterator.hasNext()) {
               VectorSchemaRoot lastRoot = root;
               root = iterator.next();
               if (root != lastRoot && lastRoot != null) lastRoot.close();
               return true;
           } else {
               return false;
           }
       }
   
       @Override
       public long bytesRead() {
           return 0;
       }
   
       @Override
       protected void closeReadSource() throws IOException {
           if (iterator instanceof AutoCloseable) {
               try {
                   ((AutoCloseable) iterator).close();
               } catch (Exception e) {
                   throw new IOException(e);
               }
           }
           root.close();
       }
   
       @Override
       protected Schema readSchema() throws IOException {
           return schema;
       }
   }
   
   ```
   
   When ArrowVectorIterator use the config with `reuseVectorSchemaRoot` is 
enabled, utf8 array may crushed on python side, but works as expectred on java 
side.
   
   Java code as below
   ```java
   try (final ArrowReader source = porter.importData(1);    returns 
ArrowVectorIteratorReader with batchSize=1
                final ArrowArrayStream stream = 
ArrowArrayStream.allocateNew(allocator)) {
               Data.exportArrayStream(allocator, source, stream);
   
               try (final ArrowReader reader = 
Data.importArrayStream(allocator, stream)) {
                   while (reader.loadNextBatch()) {
                       // root from getVectorSchemaRoot() is legal on every 
vector
                       totalRecord += 
reader.getVectorSchemaRoot().getRowCount();
                   }
               }
           }
   ```
   
   On Python side, the situation is unexplainable.
   The exported stream from Java in wrapped into a RecordBatchReader and write 
into different file formats.
   ```python
   def wrap_from_java_stream_to_generator(java_arrow_stream, allocator=None, 
yield_schema=False):
       if allocator is None:
           allocator = get_java_root_allocator().allocator
       c_stream = arrow_c.new("struct ArrowArrayStream*")
       c_stream_ptr = int(arrow_c.cast("uintptr_t", c_stream))
   
       org = jpype. JPackage("org")
       java_wrapped_stream = 
org.apache.arrow.c.ArrowArrayStream.wrap(c_stream_ptr)
   
       org.apache.arrow.c.Data.exportArrayStream(allocator, java_arrow_stream, 
java_wrapped_stream)
   
       # noinspection PyProtectedMember
       with pa. RecordBatchReader._import_from_c(c_stream_ptr) as reader:  # 
type: pa. RecordBatchReader
           if yield_schema:
               yield reader.schema
           yield from reader
   
   
   def wrap_from_java_stream(java_arrow_stream, allocator=None):
       generator = wrap_from_java_stream_to_generator(java_arrow_stream, 
allocator, yield_schema=True)
       schema = next(generator)
   
       return pa. RecordBatchReader.from_batches(schema, generator)
   ```
   For CSV, works as expected
   ```python
   with wrap_from_java_stream(java_arrow_stream, allocator) as stream:
       with pa.csv.CSVWriter(csv_path, stream.schema) as writer:
           for record_batch in stream:
               writer.write_batch(record_batch)
   ```
   
   For Parquet, writing with dataset api as below
   ```python
   with wrap_from_java_stream(java_arrow_stream, allocator) as stream:
       pa.dataset.write_dataset(stream, data_path, format="parquet")
   ```
   
   ```
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ 
   ......./python3.8/site-packages/pyarrow/dataset.py:999: in write_dataset
       _filesystemdataset_write(
   pyarrow/_dataset.pyx:3655: in pyarrow._dataset._filesystemdataset_write
       ???
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ 
   
   >   ???
   E   pyarrow.lib.ArrowInvalid: Parquet cannot store strings with size 2GB or 
more
   ```
   OR
   ```
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ 
   
   >   ???
   E   pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Length spanned 
by binary offsets (7) larger than values array (size 6)
   ```
   
   In order to making out which record raises error, RecordBatchReader is 
wrapped into a smaller batch size and log the content as below:
   ```python
   with wrap_from_java_stream(java_arrow_stream, allocator) as stream:
       def generator():
           for rb in stream:
               for i in range(rb.num_rows):
                   slice = rb.slice(i,1)
                   logger.info(slice.to_pylist())
                   yield slice
       
pa.dataset.write_dataset(pa.RecordBatchReader.from_batches(stream.schema, 
generator(), data_path, format="parquet")
   ```
   Although the logger can print the slice, but write_dataset fails 
   ```
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ 
   ......./python3.8/site-packages/pyarrow/dataset.py:999: in write_dataset
       _filesystemdataset_write(
   pyarrow/_dataset.pyx:3655: in pyarrow._dataset._filesystemdataset_write
       ???
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ 
   
   >   ???
   E   pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: First or last 
binary offset out of bounds
   ```
   
   For arrow/feather format, it seems directly write record_batch into files, 
but when record_batch is invalid when reading from file (code is similar as 
above)
   
   
   Then, if I create the ArrowVectorIteratorReader without 
reuseVectorSchemaRoot, everything works fine on Python side.
   
   ### Component(s)
   
   Java, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [Java][Python] Avoid Reuse VectorSchemaRoot for exporting ArrowArrayStream to other language [arrow-java]

Reply via email to