[I] PyArrow appears to access ArrowSchema after its release() callback is called [arrow]

via GitHub Fri, 08 Aug 2025 10:51:46 -0700


djfrancesco opened a new issue, #47296:
URL: https://github.com/apache/arrow/issues/47296


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   ## Issue Summary
   
   PyArrow appears to violate the Arrow C Data Interface specification by 
accessing the `ArrowSchema` struct *after* calling its `release()` function. 
The specification explicitly forbids accessing the structure after its release 
callback has been executed. This leads to a use-after-free error, causing 
undefined behavior or crashes.
   
   ---
   
   ## Environment
   
   - **PyArrow Version:** 21.0.0  
   - **Python Version:** 3.13  
   - **Platform:** Linux  
   - **Compiler:** GCC (for the C shared library)  
   
   ---
   
   ## Reproduction Steps
   
   1. Compile the attached C code (`arrow_schema_provider.c`) into a shared 
library (`libarrow_schema_provider.so`).  
   2. Run the provided Python script (`test_pyarrow_schema_import.py`), which 
uses `ctypes` to load the library and import the schema.  
   3. Observe the output and the exception raised by PyArrow.  
   
   ---
   
   ## Minimal C Implementation (`arrow_schema_provider.c`)
   
   This C function exports a minimal `ArrowSchema` describing a single UTF-8 
string column. The `release` callback frees all associated memory and sets 
`schema->release = NULL`, as required by the specification.
   
   ```c
   #include <stdio.h>
   #include <stdlib.h>
   #include <string.h>
   #include "arrow/c/abi.h"
   
   // Struct to manage the lifecycle of the schema's data
   struct schema_ref_counted {
       int ref_count;
       char* format_copy;
       char* name_copy;
   };
   
   // The release callback for the ArrowSchema
   static void my_schema_release(struct ArrowSchema* schema) {
       printf("C: my_schema_release called\n");
       if (!schema || !schema->release) {
           return;
       }
   
       struct schema_ref_counted* ref_schema = (struct 
schema_ref_counted*)schema->private_data;
       if (ref_schema) {
           ref_schema->ref_count--;
           printf("C: ref_count after decrement = %d\n", ref_schema->ref_count);
           if (ref_schema->ref_count <= 0) {
               printf("C: Freeing schema strings and struct\n");
               free(ref_schema->format_copy);
               free(ref_schema->name_copy);
               free(ref_schema);
           }
       }
   
       // Mark as released
       schema->release = NULL;
   }
   
   // Exported function to provide the schema
   int get_test_schema(struct ArrowSchema* out) {
       printf("C: get_test_schema called\n");
   
       struct schema_ref_counted* ref_schema = malloc(sizeof(struct 
schema_ref_counted));
       ref_schema->ref_count = 1;
       ref_schema->format_copy = strdup("s"); // UTF-8 string type
       ref_schema->name_copy = strdup("test_col");
   
       *out = (struct ArrowSchema) {
           .format = ref_schema->format_copy,
           .name = ref_schema->name_copy,
           .metadata = NULL,
           .flags = 0,
           .n_children = 0,
           .children = NULL,
           .dictionary = NULL,
           .private_data = ref_schema,
           .release = my_schema_release
       };
       return 0;
   }
   ````
   
   ---
   
   ## Observed Behavior & Output
   
   When `pa.Schema._import_from_c()` is called, the output shows that PyArrow 
calls `my_schema_release` **before** finishing schema import, freeing the 
backing memory early. The subsequent import reads from freed memory, leading to 
an error.
   
   ```python
   # Create an instance
   schema = ArrowSchema()
   
   # Call into the C function
   res = lib.get_test_schema(ctypes.byref(schema))
   if res != 0:
       raise RuntimeError("get_test_schema() failed")
   
   print("Python: Schema acquired, attempting to import into PyArrow")
   
   try:
       # pyarrow C Data interface expects a pointer to the schema
       schema_ptr = ctypes.pointer(schema)
       addr = ctypes.cast(schema_ptr, ctypes.c_void_p).value
       py_schema = pa.Schema._import_from_c(addr)
       print("Python: PyArrow imported schema successfully:", py_schema)
   except Exception as e:
       print("Python: Exception during import:", e)
   ```
   
   ```
   C: get_test_schema called
   Python: Schema acquired, attempting to import into PyArrow
   C: my_schema_release called
   C: ref_count after decrement = 0
   C: Freeing schema strings and struct
   Python: Exception during import: Cannot import schema: ArrowSchema describes 
non-struct type int16
   ```
   
   The error message about `int16` likely results from interpreting corrupted 
memory where the format string (`"s"`) used to be.
   
   ---
   
   ## Expected Behavior (per C Data Interface Specification)
   
   According to the Arrow C Data Interface specification:
   
   > The release callback MUST mark the structure as released, by setting its 
release member to `NULL`.
   >
   > Producers MUST NOT access the exported structure after `release` has been 
called.
   
   A consumer like PyArrow should fully copy all needed information from the 
`ArrowSchema` struct **before** calling the `release()` callback. No fields 
should be accessed after the release function is invoked.
   
   
[bug_minimal_repro.zip](https://github.com/user-attachments/files/21689196/bug_minimal_repro.zip)
   
   ### Component(s)
   
   C


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] PyArrow appears to access ArrowSchema after its release() callback is called [arrow]

Reply via email to