timsaucer opened a new issue, #1607:
URL: https://github.com/apache/datafusion-python/issues/1607

   ## Describe the bug
   
   After the DataFusion 54 upgrade (#1562), importing `datafusion` and 
performing any Arrow-backed operation segfaults (SIGSEGV) when the installed 
PyArrow is 24.0.0 on macOS (arm64). The crash happens on the very first Arrow 
allocation made through the bindings — for example building a literal 
`lit(pa.scalar(0, type=pa.int32()))`, which is exactly what 
`python/datafusion/functions/spark.py` does at module import, so even a bare 
`import datafusion` crashes.
   
   This is a regression introduced on the 54 upgrade branch; it does **not** 
affect the released `datafusion-python` 53.0.0.
   
   ## Symptoms
   
   `import datafusion` (or any operation that constructs an Arrow value) 
terminates the process with `Segmentation fault: 11` (exit code 139). The 
native crash report points into PyArrow's own bundled mimalloc, not into our 
code:
   
   ```
   mi_theap_malloc_zero_aligned_at_overalloc        <- SIGSEGV (mimalloc v3 
thread-heap)
   mi_theap_realloc_zero_aligned_at
   arrow::MimallocAllocator::ReallocateAligned
   arrow::PoolBuffer::Resize
   arrow::NumericBuilder<Int32Type>::FinishInternal
   arrow::py::ConvertPySequence
   __pyx_pw_7pyarrow_3lib_191scalar                 <- pa.scalar(0, 
type=int32())
   ```
   
   ## Root cause
   
   There are two independent mimalloc runtimes in the process:
   
   - `datafusion-python` installs mimalloc as the Rust `#[global_allocator]` 
(`crates/core/src/lib.rs`, enabled by the default `mimalloc` feature).
   - PyArrow 24 ships and defaults to its own bundled mimalloc memory pool.
   
   The DataFusion 54 dependency bump moved `libmimalloc-sys` 0.1.44 -> 0.1.49 
(the `mimalloc` crate 0.1.48 -> 0.1.52), which changed the bundled allocator 
from mimalloc **v2** to mimalloc **v3**. PyArrow 24 also bundles mimalloc 
**v3**. Two mimalloc-v3 runtimes collide at the macOS process-global level 
(malloc-zone / thread-local-heap initialization), corrupting each other's 
thread heap and faulting on the first allocation.
   
   The 53.0.0 release shipped mimalloc **v2** (`libmimalloc-sys` 0.1.44), which 
coexists fine with PyArrow's v3 pool — which is why no released version is 
affected.
   
   ## Affected versions / platforms
   
   - **PyArrow**: 24.0.0 triggers it. PyArrow 20.0.0 through 23.0.1 are 
unaffected (verified against the 54-branch build).
   - **datafusion-python**: the in-progress 54 upgrade branch. Released 53.0.0 
is **not** affected (verified with PyArrow 20–24).
   - **Platforms**: confirmed on macOS arm64. Linux is expected to be 
unaffected because PyArrow defaults to jemalloc there (only one mimalloc in the 
process). Windows defaults to mimalloc like macOS, so it is potentially 
affected, but the macOS-specific malloc-zone vector may not apply — needs 
verification in CI.
   
   ## Reproduction
   
   On macOS arm64 with a 54-branch build of `datafusion-python` and 
`pyarrow==24.0.0`:
   
   ```python
   import datafusion  # segfaults here (spark.py builds an int32 literal at 
import)
   ```
   
   or, isolating the allocation:
   
   ```python
   import pyarrow as pa
   from datafusion import lit
   lit(pa.scalar(0, type=pa.int32()))  # SIGSEGV
   ```
   
   ## Suggested fix
   
   Pin the bundled allocator to the mimalloc v2 line so two mimalloc-v3 
runtimes never coexist. `libmimalloc-sys` (and the `mimalloc` crate) expose a 
`v2` feature for this; adding it to the `mimalloc` feature list in 
`crates/core/Cargo.toml` keeps the Rust global allocator (no performance loss, 
no PyArrow pin) and resolves the crash. This has been verified locally: with 
the `v2` feature the 54-branch build runs cleanly against PyArrow 24.0.0.
   
   A longer-term fix should investigate making two mimalloc-v3 instances 
coexist (or platform-gating the allocator), and we should add a CI smoke test 
that imports `datafusion` and constructs an Arrow literal against the newest 
PyArrow on macOS so this regression cannot return silently.
   
   ## Acceptance / testing
   
   The fix must include test coverage: a smoke test (run on macOS, and ideally 
Windows) that imports `datafusion` and builds an Arrow-backed literal under the 
newest supported PyArrow, asserting no crash.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to