lriggs opened a new issue, #50186:
URL: https://github.com/apache/arrow/issues/50186
### Describe the bug, including details regarding any error messages,
version, and platform.
Gandiva's REPLACE(text, from, to) fails with Buffer overflow for output
string whenever the produced output string exceeds 65535 bytes. The function
hardcodes a 65535-byte output buffer cap, even though Gandiva's variable-length
output column grows dynamically and is only bounded by the int32 offset width
(~2 GB).
### Root cause
In cpp/src/gandiva/precompiled/string_ops.cc, the SQL-facing
replace_utf8_utf8_utf8 delegates to replace_with_max_len_utf8_utf8_utf8 with a
hardcoded max_length = 65535. That implementation allocates an arena buffer of
exactly max_length bytes and raises the error as soon as the running output
index would exceed it. The cap is arbitrary — nothing downstream requires it.
### To Reproduce
Any REPLACE whose result exceeds 64 KB. Minimal C++:
std::string in(35000, 'X'); // 35 KB input
replace_utf8_utf8_utf8(ctx, in.data(), 35000, "X", 1, "XY", 2, &out_len);
// -> error: "Buffer overflow for output string" (result would be 70 KB)
SQL repro (Dremio):
CREATE TABLE IF NOT EXISTS $scratch.gandiva_repro_seed AS
SELECT '<Document>line2' AS clrmsgenvlp_msg,
repeat('X', 35000) AS part_x, repeat('Y', 35000) AS part_y
FROM (VALUES (1)) AS v(x);
SELECT REPLACE(CONCAT(clrmsgenvlp_msg, part_x, part_y), 'X', 'XY')
FROM $scratch.gandiva_repro_seed;
CONCAT / CONCAT_WS / LISTAGG are not at fault — the failure is the REPLACE
applied on top of their large output.
### Expected behavior
REPLACE should return the full result regardless of size (up to Gandiva's
normal variable-length limits), not fail at an arbitrary 64 KB threshold.
### Component(s)
C++, Gandiva
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]