liaoxin01 opened a new pull request, #60920: URL: https://github.com/apache/doris/pull/60920
## Proposed changes Optimize stream load CSV read performance for nullable string columns by eliminating per-row overhead from the SerDe abstraction layer. ### Changes 1. **Cache nullable string column pointers per-batch**: Pre-compute `assert_cast` results (ColumnStr and NullMap pointers) once per batch instead of once per row per column, stored in `NullableStringColumnCache`. 2. **Inline nullable string write path**: Bypass `_deserialize_nullable_string` and `StringSerDe::deserialize_one_cell_from_csv` in the hot loop, directly performing null checks, escape handling, and `insert_data`/`push_back`. 3. **Pre-reserve column capacity**: Reserve `offsets`, `chars`, and `null_map` capacity at batch start to reduce PODArray realloc overhead during the row loop. ### Performance Tested with ClickBench dataset stream load: - Import time reduced from 571s to 476s (**16.6% improvement**) - Compared to 2.1.7 baseline (650s), now **26.8% faster** ### Flame graph analysis Before optimization, `_deserialize_nullable_string` path dominated with +96s self-time from: - Per-row `assert_cast<ColumnNullable&>` (+65s) - `StringSerDe::deserialize_one_cell_from_csv` intermediate layer (+54s) - Repeated PODArray reserve/realloc during column growth After optimization, these costs are eliminated or amortized to per-batch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
