brunal opened a new issue, #46407:
URL: https://github.com/apache/arrow/issues/46407

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Steps to reproduce:
   * Create a ListArray in Rust
   * Slice it (at index > 0)
   * Send it to C++ via the C data interface
   * Perform IPC serialization of the array (wrapped in a RecordBatch)
   * The resulting message produces invalid data upon deserialization, in C++ 
or Rust, for its offset buffer points past the end of its child data.
   
   Here is a standalone python reproduction:
   ```python
   import pyarrow as pa
   
   # This ListArray represents [[3, 4, 5]]. It was sliced the way Rust slices
   # ListArrays.
   # The C++ slicing would have resulted in offsets_buffer = [0, 2, 5] and
   # top-level offset = 1.
   list_array = pa.ListArray.from_arrays(offsets=pa.array([2, 5]), values=[1, 
2, 3, 4, 5])
   list_array.validate()
   assert list_array == pa.array([[3, 4, 5]])
   
   table = pa.table({"col": list_array})
   sink = pa.BufferOutputStream()
   pa.ipc.new_stream(sink, table.schema).write_table(table)
   
   reader = pa.ipc.RecordBatchStreamReader(sink.getvalue())
   table_deserialized = pa.Table.from_batches(list(reader))
   
   # This raises pyarrow.lib.ArrowInvalid: In chunk 0: Invalid: First or last 
list offset out of bounds
   table_deserialized.column(0).validate()
   ``` 
   
   The gist of the issue is that:
   * Rust and C++ slice ListArray differently
   * C++ bumps the top-level offset of the ArrayData
   * However Rust does not maintain a top-level offset. Instead, it slices the 
offset buffers
   * Upon IPC serialization of a ListArray, C++ only looks at the top-level 
offset do decide whether to rebuild the offsets buffer. However, it properly 
rebuilds the child data
   * This leads to a corrupt serialized message
   
   I have a test+fix for this.
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to