vortual opened a new issue, #60172: URL: https://github.com/apache/doris/issues/60172
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no similar issues. ### Version 4.0.2 ### What's Wrong? # 以下内容为AI帮忙诊断分析 ## 1. 问题概述 Doris BE (Backend) 进程在 2026-01-22 22:30 左右发生 SIGSEGV 崩溃,导致正在进行的 Mongo CDC 同步任务中断。BE 进程自动重启后恢复正常,未造成数据丢失。 --- ## 2. 崩溃时间线 | 时间 | 事件 | 来源 | |------|------|------| | 22:30:11.973 | 前一批 stream load 正常完成 (txn 109883, 109893) | BE INFO 日志 | | 22:30:12.170 | **两个 stream load 在同一毫秒到达 FE** | FE 日志 | | 22:30:12.173 | FE 为 txn 109908 开启事务 | FE 日志 | | 22:30:12.181 | BE 开始执行 stream load | BE INFO 日志 | | 22:30:12.189 | HTTP header 处理完成 | BE INFO 日志 | | 22:30:1x~50 | **💀 BE SIGSEGV 崩溃** | be.out | | 22:30:50.105 | BE 重启完成 | be.out | | 22:31:10.347 | FE 检测到 BE 重启,回滚 txn 109908 | FE 日志 | | 22:31:18.608 | Mongo CDC 自动重试 stream load | FE 日志 | | 22:31:34.672 | 重试事务 109929 成功提交 | FE 日志 | --- ## 3. 日志证据 ### 3.1 崩溃堆栈 (be.out) **文件**: `/data/apache-doris-4.0.2-bin-x64/be/log/be.out` ``` *** Query id: 82486cd1d80c3022-81101df6583bb08e *** *** is nereids: 1 *** *** tablet id: 0 *** *** Aborted at 1769092212 (unix time) try "date -d @1769092212" if you are using GNU date *** *** Current BE git commitID: 30d2df0459 *** *** SIGSEGV address not mapped to object (@0x0) received by PID 2408066 (TID 2461123 OR 0x7f65349bc640) from PID 0; stack trace: *** 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_release/doris/be/src/common/signal_handler.h:420 1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0] in /opt/jdk-17.0.17+10/lib/server/libjvm.so 2# JVM_handle_linux_signal in /opt/jdk-17.0.17+10/lib/server/libjvm.so 3# 0x00007F795AA3FC30 in /lib64/libc.so.6 4# 0x0000561F10943A42 in /data/apache-doris-4.0.2-bin-x64/be/lib/doris_be 5# brpc::Controller::call_id() in /data/apache-doris-4.0.2-bin-x64/be/lib/doris_be 6# doris::DummyBrpcCallback<doris::PTabletWriterAddBlockResult>::DummyBrpcCallback() at /home/zcp/repo_center/doris_release/doris/be/src/util/brpc_closure.h:39 7# doris::vectorized::VNodeChannel::init(doris::RuntimeState*) at /home/zcp/repo_center/doris_release/doris/be/src/vec/sink/writer/vtablet_writer.cpp:575 8# doris::vectorized::IndexChannel::init(doris::RuntimeState*, std::vector<doris::TTabletWithPartition, std::allocator<doris::TTabletWithPartition> > const&, bool) at /home/zcp/repo_center/doris_release/doris/be/src/vec/sink/writer/vtablet_writer.cpp:151 9# doris::vectorized::VTabletWriter::_init(doris::RuntimeState*, doris::RuntimeProfile*) at /home/zcp/repo_center/doris_release/doris/be/src/vec/sink/writer/vtablet_writer.cpp:1630 10# doris::vectorized::VTabletWriter::open(doris::RuntimeState*, doris::RuntimeProfile*) at /home/zcp/repo_center/doris_release/doris/be/src/vec/sink/writer/vtablet_writer.cpp:1389 11# doris::vectorized::AsyncResultWriter::process_block(doris::RuntimeState*, doris::RuntimeProfile*) at /home/zcp/repo_center/doris_release/doris/be/src/vec/sink/writer/async_result_writer.cpp:119 12# std::_Function_handler<void (), doris::vectorized::AsyncResultWriter::start_writer(doris::RuntimeState*, doris::RuntimeProfile*)::$_0>::_M_invoke(std::_Any_data const&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:292 13# doris::ThreadPool::dispatch_thread() at /home/zcp/repo_center/doris_release/doris/be/src/util/threadpool.cpp:623 14# doris::Thread::supervise_thread(void*) at /home/zcp/repo_center/doris_release/doris/be/src/util/thread.cpp:461 15# start_thread in /lib64/libc.so.6 16# __GI___clone3 in /lib64/libc.so.6 ``` **分析**: - 错误类型: `SIGSEGV address not mapped to object (@0x0)` - 空指针解引用 - Query ID: `82486cd1d80c3022-81101df6583bb08e` - 崩溃位置: `VNodeChannel::init()` → `DummyBrpcCallback()` → `brpc::Controller::call_id()` - 根因: bRPC Controller 对象为 NULL --- ### 3.2 触发崩溃的查询 (BE INFO 日志) **文件**: `/data/apache-doris-4.0.2-bin-x64/be/log/be.INFO.log.20260122-193049` ``` I20260122 22:30:12.172951 2409869 stream_load.cpp:218] new income streaming load request.id=82486cd1d80c3022-81101df6583bb08e, job_id=-1, txn_id=-1, label=lb_mongo_doris_ods_table2_0_3866_29d47f74-d2e7-4add-ac00-0ab0a131d7fb, elapse(s)=0, db=ods, tbl=table2, group_commit=0, HTTP headers=... I20260122 22:30:12.173527 2409884 stream_load.cpp:218] new income streaming load request.id=ab4c7fe784d16dc7-f873f51fc204c8b1, job_id=-1, txn_id=-1, label=lb_mongo_doris_ods_table1_0_3866_8e69c489-c858-4d5b-98bd-035b3f8df879, elapse(s)=0, db=ods, tbl=table1, group_commit=0, HTTP headers=... I20260122 22:30:12.181313 2409869 stream_load_executor.cpp:74] begin to execute stream load. label=lb_mongo_doris_ods_table2_0_3866_29d47f74-d2e7-4add-ac00-0ab0a131d7fb, txn_id=109908, query_id=82486cd1d80c3022-81101df6583bb08e I20260122 22:30:12.189524 2409869 stream_load.cpp:225] finished to handle HTTP header, id=82486cd1d80c3022-81101df6583bb08e, job_id=-1, txn_id=109908, label=lb_mongo_doris_ods_table2_0_3866_29d47f74-d2e7-4add-ac00-0ab0a131d7fb, elapse(s)=0 ``` **分析**: - 触发查询 ID: `82486cd1d80c3022-81101df6583bb08e` - 目标表: `table2` - 事务 ID: 109908 - 来源: Mongo CDC 同步任务 - 注意: 两个 stream load 在 0.6ms 内先后到达 (22:30:12.172 和 22:30:12.173) --- ### 3.3 并发 Stream Load 同毫秒到达 (FE 日志) **文件**: `/data/apache-doris-4.0.2-bin-x64/fe/log/fe.log.20260122-1` ``` 2026-01-22 22:30:12,170 INFO (qtp149820420-710|710) [LoadAction.streamLoad():106] streamload action, db: ods, tbl: table1, headers: ...label:lb_mongo_doris_ods_table1_0_3866_8e69c489-c858-4d5b-98bd-035b3f8df879... 2026-01-22 22:30:12,170 INFO (qtp149820420-10209|10209) [LoadAction.streamLoad():106] streamload action, db: ods, tbl: table2, headers: ...label:lb_mongo_doris_ods_table2_0_3866_29d47f74-d2e7-4add-ac00-0ab0a131d7fb... 2026-01-22 22:30:12,173 INFO (thrift-server-pool-470|12411) [DatabaseTransactionMgr.beginTransaction():382] begin transaction: txn id 109908 with label lb_mongo_doris_ods_table2_0_3866_29d47f74-d2e7-4add-ac00-0ab0a131d7fb from coordinator BE: 10.66.7.1, listener id: -1 ``` **关键发现**: - **两个 stream load 在完全相同的毫秒 (22:30:12,170) 到达 FE** - 这两个请求分别由线程 710 和 10209 处理 - 高并发触发了 bRPC 初始化的竞争条件 --- ### 3.6 BE 重启记录 (be.out) **文件**: `/data/apache-doris-4.0.2-bin-x64/be/log/be.out` ``` INFO: java_cmd /opt/jdk-17.0.17+10/bin/java INFO: jdk_version 17 StdoutLogger 2026-01-22 22:30:50,105 Start time: Thu Jan 22 10:30:50 PM CST 2026 INFO: java_cmd /opt/jdk-17.0.17+10/bin/java INFO: jdk_version 17 OpenJDK 64-Bit Server VM warning: Option CriticalJNINatives was deprecated in version 16.0 and will likely be removed in a future release. ... start BE in local mode ``` **分析**: - BE 在 22:30:50 重启完成 - 从崩溃到重启约 38 秒 --- ### 3.7 并发数量统计 (BE INFO 日志) **命令**: ```bash grep "table1" be.INFO.log.20260122-193049 | grep "22:30:1" | wc -l ``` **结果**: 9 **分析**: - 在 22:30:1x 这 10 秒时间段内 - 有 9 条与 `table1` 相关的 stream load 记录 - 证明高并发场景 --- ### 3.8 排除 OOM (系统日志) **命令**: ```bash dmesg | grep -i "killed process" ``` **结果**: 空 **分析**: - 没有 OOM Kill 记录 --- ## 4. 根因分析 ### 4.1 直接原因 在 `VNodeChannel::init()` 函数中创建 `DummyBrpcCallback` 时,访问了未初始化的 `brpc::Controller` 对象,导致空指针解引用 (SIGSEGV @0x0)。 ### 4.2 触发条件 两个 stream load 请求在 **完全相同的毫秒** (22:30:12,170) 到达 FE,并被转发到同一个 BE 节点并发初始化 `VNodeChannel`,触发了 bRPC 相关的竞争条件 (Race Condition)。 ### 4.3 调用栈分析 ``` AsyncResultWriter::process_block() └→ VTabletWriter::open() └→ VTabletWriter::_init() └→ IndexChannel::init() └→ VNodeChannel::init() // 初始化节点通道 └→ DummyBrpcCallback() // 创建 bRPC 回调 └→ brpc::Controller::call_id() // 💥 空指针 ``` ### 4.4 初步判断 **并发竞争条件 (Race Condition)**:多个 stream load 同时初始化 VNodeChannel 时,共享的 bRPC Controller 资源产生竞争,导致其中一个线程访问到未初始化的对象。 ### What You Expected? 正常运行 ### How to Reproduce? _No response_ ### Anything Else? _No response_ ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
