morningman opened a new pull request, #64797:
URL: https://github.com/apache/doris/pull/64797
### What problem does this PR solve?
Issue Number: ref #62259 (partial — robustness layer only, does not close
the issue)
Related PR: N/A
Problem Summary:
When an Arrow Flight SQL query scans an external table (e.g. Iceberg) in
batch split mode, the BE fetches file splits from the FE via the
`fetchSplitBatch` thrift RPC *during* the scan. If that fetch fails — most
notably when the split source has already been released — the error path could
crash the BE (SIGSEGV in
`arrow::flight::internal::TransportStatus::FromStatus`) instead of failing the
query gracefully (see #62259).
This PR is the **robustness layer** for that issue: it ensures any
`fetchSplitBatch` failure makes the query fail gracefully rather than crashing
the BE. It does **not** fix the underlying split source lifecycle problem (the
source being released after `GetFlightInfo` but before `DoGet` on the Arrow
Flight two-phase path), which is tracked separately in the issue.
Changes:
1. **BE `split_source_connector`** — guard `result.status.error_msgs[0]`
with an `empty()` check to avoid an out-of-bounds vector read when the FE
returns a non-OK status without an error message.
2. **BE `to_arrow_status`** — truncate the error message handed to
Arrow/gRPC to a length well below 8192. The message is carried in the gRPC
trailer (an HTTP2 header) and may be percent-encoded, so an oversized one can
break the response or crash the flight transport status conversion. The 8192
limit was already documented in a comment in this function but was never
enforced. The full message is still logged on the BE.
3. **FE `fetchSplitBatch`** — when the split source has been released,
return a structured `TStatus(NOT_FOUND)` with a message instead of throwing a
bare `TException`. The BE then receives a well-formed, non-empty error through
the normal result path instead of a thrift transport exception.
### Release note
Fix a BE crash (SIGSEGV) that could happen on the error path of Arrow Flight
SQL queries against external tables when fetching split batches from the FE
fails.
### Check List (For Author)
- Test
- [ ] Regression test
- [ ] Unit Test
- [ ] Manual test (add detailed scripts or steps below)
- [x] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
- [ ] Previous test can cover this change.
- [ ] No code files have been changed.
- [x] Other reason: This is defensive hardening of an error path (an
out-of-bounds guard, a message-length cap, and returning a structured error
status instead of throwing). It only triggers when `fetchSplitBatch` fails;
existing tests cover the success path, and the crash depends on Arrow/gRPC
transport internals that are hard to reproduce deterministically in CI.
- Behavior changed:
- [x] Yes. On `fetchSplitBatch` failure the query now fails with a clear
error message instead of (potentially) crashing the BE, and the FE no longer
throws a bare `TException` for a released split source (it returns a
`NOT_FOUND` status).
- Does this need documentation?
- [x] No.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]