steven-winfield-quantohm opened a new issue, #3508:
URL: https://github.com/apache/iceberg-python/issues/3508
### Apache Iceberg version
0.11.0 (latest release)
### Please describe the bug 🐞
When upserting into an Iceberg table, PyIceberg first scans the target table
to
find which existing rows match the source rows' key columns. It builds that
"matching" predicate in ``pyiceberg.table.upsert_util.create_match_filter``:
* For a **single** join column it emits one flat ``In(col, [v1, v2, ...])``.
PyArrow lowers this to a single ``is_in`` compute node, no matter how many
values it contains — so single-column upserts of huge tables are fine.
* For a **multi-column** key it instead emits one disjunct per distinct key
tuple::
Or(And(c1 == v1, c2 == w1),
And(c1 == v2, c2 == w2),
...) # ONE disjunct PER ROW
PyIceberg builds that ``Or`` as a balanced tree, so the *Python* side copes.
But when the expression is handed to PyArrow's dataset scanner as a filter,
the
C++ expression engine canonicalises it: ``Dataset::GetFragments`` calls
``SimplifyWithGuarantee`` → ``Canonicalize``, which flattens the associative
``or_kleene`` chain and then **recurses** over it. With tens of thousands of
disjuncts that recursion overflows the C++ call stack and the **process
segfaults** (SIGSEGV) — typically after several minutes of work, with a
backtrace full of ``arrow::compute::Canonicalize`` / ``ModifyExpression``
frames.
Reference: https://github.com/apache/iceberg-python/issues/3272
Note that apache/iceberg-python#3448 addresses a *different* upsert segfault
(a
per-batch Acero re-filter in ``_task_to_record_batches``, mostly observed on
Apple Silicon). It does not touch the ``GetFragments`` canonicalisation path
exercised here, so it does not help with this crash.
The fix
-------
Produce a predicate that matches exactly the same rows, but with far fewer
disjuncts. Group the key tuples and emit a single ``In`` over whichever
column
collapses to the fewest distinct "prefix" combinations (choosing that column
makes the result independent of the caller's column ordering)::
Or(And(c1 == v1, c2 IN [w, x, y]),
And(c1 == v2, c2 IN [z]),
...) # one disjunct per distinct PREFIX
The disjunct count drops from "number of rows" to "number of distinct prefix
values". In the synthetic data below there are 50 000 unique ids spread over
just 50 group values, so the predicate shrinks from 50 000 disjuncts to 50 —
shallow enough that PyArrow's canonicaliser no longer overflows.
Caveat
------
This helps whenever at least one key column is low-cardinality (or,
equivalently,
one column is near-unique and can be folded into the ``In``). A genuinely
high-cardinality *composite* key — where every column is near-unique and all
of
them are needed to identify a row — still produces roughly one disjunct per
row
even after grouping, and can still overflow. For that pathological case the
only robust option is to upsert in smaller batches.
[pyiceberg-stacktrace.txt](https://github.com/user-attachments/files/28956590/pyiceberg-stacktrace.txt)
[iceberg_upsert_segfault_repro.py](https://github.com/user-attachments/files/28956597/iceberg_upsert_segfault_repro.py)
### Willingness to contribute
- [x] I can contribute a fix for this bug independently
- [ ] I would be willing to contribute a fix for this bug with guidance from
the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]