djfrancesco opened a new issue, #47296:
URL: https://github.com/apache/arrow/issues/47296
### Describe the bug, including details regarding any error messages,
version, and platform.
## Issue Summary
PyArrow appears to violate the Arrow C Data Interface specification by
accessing the `ArrowSchema` struct *after* calling its `release()` function.
The specification explicitly forbids accessing the structure after its release
callback has been executed. This leads to a use-after-free error, causing
undefined behavior or crashes.
---
## Environment
- **PyArrow Version:** 21.0.0
- **Python Version:** 3.13
- **Platform:** Linux
- **Compiler:** GCC (for the C shared library)
---
## Reproduction Steps
1. Compile the attached C code (`arrow_schema_provider.c`) into a shared
library (`libarrow_schema_provider.so`).
2. Run the provided Python script (`test_pyarrow_schema_import.py`), which
uses `ctypes` to load the library and import the schema.
3. Observe the output and the exception raised by PyArrow.
---
## Minimal C Implementation (`arrow_schema_provider.c`)
This C function exports a minimal `ArrowSchema` describing a single UTF-8
string column. The `release` callback frees all associated memory and sets
`schema->release = NULL`, as required by the specification.
```c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "arrow/c/abi.h"
// Struct to manage the lifecycle of the schema's data
struct schema_ref_counted {
int ref_count;
char* format_copy;
char* name_copy;
};
// The release callback for the ArrowSchema
static void my_schema_release(struct ArrowSchema* schema) {
printf("C: my_schema_release called\n");
if (!schema || !schema->release) {
return;
}
struct schema_ref_counted* ref_schema = (struct
schema_ref_counted*)schema->private_data;
if (ref_schema) {
ref_schema->ref_count--;
printf("C: ref_count after decrement = %d\n", ref_schema->ref_count);
if (ref_schema->ref_count <= 0) {
printf("C: Freeing schema strings and struct\n");
free(ref_schema->format_copy);
free(ref_schema->name_copy);
free(ref_schema);
}
}
// Mark as released
schema->release = NULL;
}
// Exported function to provide the schema
int get_test_schema(struct ArrowSchema* out) {
printf("C: get_test_schema called\n");
struct schema_ref_counted* ref_schema = malloc(sizeof(struct
schema_ref_counted));
ref_schema->ref_count = 1;
ref_schema->format_copy = strdup("s"); // UTF-8 string type
ref_schema->name_copy = strdup("test_col");
*out = (struct ArrowSchema) {
.format = ref_schema->format_copy,
.name = ref_schema->name_copy,
.metadata = NULL,
.flags = 0,
.n_children = 0,
.children = NULL,
.dictionary = NULL,
.private_data = ref_schema,
.release = my_schema_release
};
return 0;
}
````
---
## Observed Behavior & Output
When `pa.Schema._import_from_c()` is called, the output shows that PyArrow
calls `my_schema_release` **before** finishing schema import, freeing the
backing memory early. The subsequent import reads from freed memory, leading to
an error.
```python
# Create an instance
schema = ArrowSchema()
# Call into the C function
res = lib.get_test_schema(ctypes.byref(schema))
if res != 0:
raise RuntimeError("get_test_schema() failed")
print("Python: Schema acquired, attempting to import into PyArrow")
try:
# pyarrow C Data interface expects a pointer to the schema
schema_ptr = ctypes.pointer(schema)
addr = ctypes.cast(schema_ptr, ctypes.c_void_p).value
py_schema = pa.Schema._import_from_c(addr)
print("Python: PyArrow imported schema successfully:", py_schema)
except Exception as e:
print("Python: Exception during import:", e)
```
```
C: get_test_schema called
Python: Schema acquired, attempting to import into PyArrow
C: my_schema_release called
C: ref_count after decrement = 0
C: Freeing schema strings and struct
Python: Exception during import: Cannot import schema: ArrowSchema describes
non-struct type int16
```
The error message about `int16` likely results from interpreting corrupted
memory where the format string (`"s"`) used to be.
---
## Expected Behavior (per C Data Interface Specification)
According to the Arrow C Data Interface specification:
> The release callback MUST mark the structure as released, by setting its
release member to `NULL`.
>
> Producers MUST NOT access the exported structure after `release` has been
called.
A consumer like PyArrow should fully copy all needed information from the
`ArrowSchema` struct **before** calling the `release()` callback. No fields
should be accessed after the release function is invoked.
[bug_minimal_repro.zip](https://github.com/user-attachments/files/21689196/bug_minimal_repro.zip)
### Component(s)
C
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]