Mryange opened a new pull request, #64796:
URL: https://github.com/apache/doris/pull/64796
### What problem does this PR solve?
Doris converts Arrow arrays into Doris columns through
`DataTypeSerDe::read_column_from_arrow`. If the Arrow producer sends malformed
array metadata, such as truncated validity bitmaps, truncated offsets buffers,
non-monotonic string/list offsets, or offsets pointing past the child/value
buffer, the existing conversion code may read invalid Arrow memory and crash BE.
Root cause: the Arrow-to-Doris serde path trusted Arrow array metadata
before accessing Arrow buffers. Several hot paths call `IsNull()`, `Value()`,
raw value offsets, list offsets, or child arrays directly, so malformed Arrow
buffers can trigger out-of-bounds reads before Doris reports a clean error.
This PR adds lightweight, type-specific Arrow input validation before those
buffer accesses. The checks are modeled as local preflight checks rather than
full `ValidateFull()`: validity bitmap size, fixed-width data buffer size,
boolean bitmap size, binary/string offsets buffer size, per-value data range,
and list/map offsets monotonicity plus child length bounds. A BE config
`enable_arrow_input_validation` is added and defaults to `true`.
The change also fixes an existing `FixedSizeBinaryArray` sliced-read null
check: the loop uses a relative index after `GetValue(start)`, but `IsNull()`
expects the original Arrow row index, so it must check `start + offset_i`.
| Type | Rows | Check disabled | Check enabled | Overhead |
| --- | ---: | ---: | ---: | ---: |
| String | 4096 | 39,352 ns | 42,032 ns | +6.8% |
| String | 65536 | 604,782 ns | 624,064 ns | +3.2% |
| Int64 | 4096 | 1,480 ns | 1,523 ns | +2.9% |
| Int64 | 65536 | 16,160 ns | 15,883 ns | -1.7% |
| Boolean | 4096 | 4,784 ns | 4,888 ns | +2.2% |
| Boolean | 65536 | 79,017 ns | 80,453 ns | +1.8% |
| ArrayString | 4096 | 139,747 ns | 147,793 ns | +5.8% |
| ArrayString | 65536 | 2,384,377 ns | 2,477,601 ns | +3.9% |
| MapStringInt | 4096 | 84,375 ns | 96,022 ns | +13.8% |
| MapStringInt | 65536 | 2,742,538 ns | 2,903,358 ns | +5.9% |
### Release note
None
### Check List (For Author)
- Test <!-- At least one of them must be included. -->
- [ ] Regression test
- [ ] Unit Test
- [ ] Manual test (add detailed scripts or steps below)
- [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
- [ ] Previous test can cover this change.
- [ ] No code files have been changed.
- [ ] Other reason <!-- Add your reason? -->
- Behavior changed:
- [ ] No.
- [ ] Yes. <!-- Explain the behavior change -->
- Does this need documentation?
- [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
https://github.com/apache/doris-website/pull/1214 -->
### Check List (For Reviewer who merge this PR)
- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR should
merge into -->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]