kosiew opened a new issue, #22688:
URL: https://github.com/apache/datafusion/issues/22688
## Summary
DataFusion currently has mixed safety behavior for variable-size string
output construction.
`repeat` now performs explicit checked accounting before writing output
buffers, but many other string UDF paths still rely on ad hoc capacity
estimates and panic-prone builder internals (`expect("byte array offset
overflow")`).
This issue proposes introducing shared, fallible helpers (or a fallible
builder API) that return `DataFusionError` instead of panicking when cumulative
byte offsets overflow.
## Motivation
We should enforce a consistent no-panic contract for runtime string UDF
execution, especially on extreme or adversarial inputs.
Today:
- `repeat` contains local checked logic (`repeat_len`,
`calculate_capacities`) to prevent overflows before array construction.
- `GenericStringArrayBuilder` still has panic paths for cumulative offset
conversion.
- Several string/unicode UDF call sites still depend on heuristic or ad hoc
capacity math and then append through panic-capable builder methods.
This inconsistency makes correctness auditing harder and increases risk that
a code path regresses to panic behavior.
## Current Evidence
Representative locations:
1. Checked, local approach in `repeat`
- `datafusion/functions/src/string/repeat.rs`
- Uses checked multiplication/addition and explicit max-size validation
before builder use.
2. Panic-capable builder internals
- `datafusion/functions/src/strings.rs`
- `GenericStringArrayBuilder` methods convert cumulative byte length to
offsets via `expect("byte array offset overflow")`.
3. UDF call sites that currently rely on builder behavior + ad hoc pre-sizing
- `datafusion/functions/src/string/replace.rs`
- `datafusion/functions/src/string/common.rs`
- `datafusion/functions/src/unicode/initcap.rs`
- `datafusion/functions/src/unicode/substrindex.rs`
## Problem Statement
Variable-size string output assembly lacks a single, reusable, fallible
accounting path for:
- Per-row output byte length validation
- Total output byte accumulation validation
- Offset type bound validation (`i32` for `Utf8`, `i64` for `LargeUtf8`)
As a result, failure behavior differs by function:
- Some functions return structured execution errors.
- Others can panic from builder internals when offsets exceed type bounds.
## Desired Outcome
All variable-size string-producing UDF code paths should fail with
`DataFusionError` (not panic) when output byte-size or offset limits are
exceeded.
## Proposed Approach
Implement a shared fallible mechanism and migrate call sites incrementally.
### Option A (preferred): Fallible builder extension
Add fallible append APIs to `GenericStringArrayBuilder` (and related builder
wrappers):
- `try_append_value(&mut self, value: &str) -> Result<()>`
- `try_append_placeholder(&mut self) -> Result<()>`
- `try_append_with(...) -> Result<()>`
Internally replace `expect(...)` on offset conversion with explicit checked
conversion returning `exec_err!` (or equivalent `DataFusionError`).
Pros:
- Strongest safety boundary at the lowest level.
- Prevents accidental panic regressions in new UDF code.
### Option B: Shared preflight accounting helper
Provide reusable helpers for per-row and total-capacity checked accounting
and keep existing append API.
Pros:
- Smaller surface change.
Cons:
- Safety depends on every caller using preflight correctly; easier to
regress.
### Recommendation
Do both in sequence:
1. Introduce shared checked accounting helper(s).
2. Add fallible builder APIs and migrate high-risk call sites.
3. Keep existing panic APIs temporarily for compatibility; deprecate
internally.
## Scope
In scope:
- `datafusion/functions/src/strings.rs` builder APIs and internals
- String UDF call sites that build variable-size outputs in
`datafusion/functions/src/string/*` and `datafusion/functions/src/unicode/*`
- Regression tests for overflow/no-panic behavior
Out of scope:
- Non-string builders
- Broad refactors unrelated to offset/capacity overflow behavior
## Implementation Plan
1. Add shared checked helpers for:
- `checked_repeat_len` style multiplication
- checked cumulative capacity addition
- checked conversion to target offset type
2. Introduce fallible builder append methods:
- Return `Result<()>` on offset overflow
- Preserve current performance characteristics where possible
3. Migrate selected UDFs first:
- `replace`
- case conversion path in `string/common`
- `initcap`
- `substrindex`
- other variable-size emitters discovered during audit
4. Add targeted overflow tests per migrated UDF path.
5. Optionally deprecate direct panic append methods for internal usage and
enforce fallible variants in new code.
## Acceptance Criteria
- No panic on cumulative offset overflow in migrated string UDF paths.
- Overflow conditions return stable `DataFusionError` messages.
- Existing behavior/performance remains unchanged for normal-sized inputs.
- New tests cover:
- per-item length overflow
- cumulative output overflow
- null-handling with overflow-adjacent inputs
- CI passes for affected crates and relevant SQLLogicTests.
## Testing Strategy
Minimum validation:
- Unit tests in `datafusion-functions` for each migrated UDF with
overflow-triggering inputs.
- Keep/extend existing `repeat` overflow regression test coverage.
- Run targeted suites:
- `cargo test -p datafusion-functions`
- `cargo test -p datafusion-sqllogictest --test sqllogictests
string_literal`
## Related PR
#22293
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]