xy720 opened a new pull request, #48101: URL: https://github.com/apache/doris/pull/48101
### What problem does this PR solve? coredump: ``` F20250219 16:11:04.432318 1252192 global.cpp:309] /data/home/lambxu/work/git/doris-tencent/doris-2.1/thirdparty/installed/include/google/protobuf/map .h:1293 CHECK failed: it != end(): key not found: 184236 *** Check failure stack trace: *** @ 0x5626cb1376e6 google::LogMessage::SendToLog() @ 0x5626cb134130 google::LogMessage::Flush() @ 0x5626cb137f29 google::LogMessageFatal::~LogMessageFatal() @ 0x5626ccc270ea (unknown) @ 0x5626cb86bfa5 google::protobuf::internal::LogMessage::Finish() @ 0x5626c15f6e1e google::protobuf::Map<>::at<>() @ 0x5626c15f3604 doris::TabletsChannel::_commit_txn() @ 0x5626c15f2f8b doris::TabletsChannel::close() @ 0x5626c14fb43a doris::LoadChannel::_handle_eos() @ 0x5626c14fb0f2 doris::LoadChannel::add_batch() @ 0x5626c14f5800 doris::LoadChannelMgr::add_batch() @ 0x5626c1667811 std::_Function_handler<>::_M_invoke() @ 0x5626c168199b doris::WorkThreadPool<>::work_thread() @ 0x5626cdf081a0 execute_native_thread_routine @ 0x7efdc1a25215 start_thread @ 0x7efdc1aa7bdc __clone3 @ (nil) (unknown) *** Query id: 9c94fa66404748-ab7987b4563a6318 *** *** is nereids: 0 *** *** tablet id: 0 *** *** Aborted at 1739952664 (unix time) try "date -d @1739952664" if you are using GNU date *** *** Current BE git commitID: f112af0fd2 *** *** SIGABRT unknown detail explain (@0x3e8001317ab) received by PID 1251243 (TID 1252192 OR 0x7efbc38bd6c0) from PID 1251243; stack trace: *** 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /data/home/lambxu/work/git/doris-tencent/doris-2.1/be/src/c ommon/signal_handler.h:421 1# 0x00007EFDC19D7AD0 in /lib64/libc.so.6 2# __pthread_kill_implementation in /lib64/libc.so.6 3# raise in /lib64/libc.so.6 4# __GI_abort in /lib64/libc.so.6 5# 0x00005626CB141FBD in /usr/local/service/doris/lib/be/doris_be 6# 0x00005626CB1345FA in /usr/local/service/doris/lib/be/doris_be 7# google::LogMessage::SendToLog() in /usr/local/service/doris/lib/be/doris_be 8# google::LogMessage::Flush() in /usr/local/service/doris/lib/be/doris_be 9# google::LogMessageFatal::~LogMessageFatal() in /usr/local/service/doris/lib/be/doris_be 10# 0x00005626CCC270EA in /usr/local/service/doris/lib/be/doris_be 11# google::protobuf::internal::LogMessage::Finish() in /usr/local/service/doris/lib/be/doris_be 12# doris::PSlaveTabletNodes const& google::protobuf::Map<long, doris::PSlaveTabletNodes>::at<long>(long const&) const in /usr/local/service/doris/li b/be/doris_be 13# doris::TabletsChannel::_commit_txn(doris::DeltaWriter*, doris::PTabletWriterAddBlockRequest const&, doris::PTabletWriterAddBlockResult*) in /usr/ local/service/doris/lib/be/doris_be 14# doris::TabletsChannel::close(doris::LoadChannel*, doris::PTabletWriterAddBlockRequest const&, doris::PTabletWriterAddBlockResult*, bool*) at /dat a/home/lambxu/work/git/doris-tencent/doris-2.1/be/src/runtime/tablets_channel.cpp:367 15# doris::LoadChannel::_handle_eos(doris::BaseTabletsChannel*, doris::PTabletWriterAddBlockRequest const&, doris::PTabletWriterAddBlockResult*) at / data/home/lambxu/work/git/doris-tencent/doris-2.1/be/src/runtime/load_channel.cpp:191 16# doris::LoadChannel::add_batch(doris::PTabletWriterAddBlockRequest const&, doris::PTabletWriterAddBlockResult*) at /data/home/lambxu/work/git/dori s-tencent/doris-2.1/be/src/runtime/load_channel.cpp:172 17# doris::LoadChannelMgr::add_batch(doris::PTabletWriterAddBlockRequest const&, doris::PTabletWriterAddBlockResult*) at /data/home/lambxu/work/git/d oris-tencent/doris-2.1/be/src/runtime/load_channel_mgr.cpp:156 18# std::_Function_handler<void (), doris::PInternalServiceImpl::tablet_writer_add_block(google::protobuf::RpcController*, doris::PTabletWriterAddBlo ckRequest const*, doris::PTabletWriterAddBlockResult*, google::protobuf::Closure*)::$_0>::_M_invoke(std::_Any_data const&) at /data/home/lambxu/insta lls/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:291 19# doris::WorkThreadPool<false>::work_thread(int) at /data/home/lambxu/work/git/doris-tencent/doris-2.1/be/src/util/work_thread_pool.hpp:159 20# execute_native_thread_routine at ../../../../../libstdc++-v3/src/c++11/thread.cc:84 21# start_thread in /lib64/libc.so.6 22# __GI___clone3 in /lib64/libc.so.6 ``` This is because the PTabletWriterAddBlockRequest sent from coordinator Be may not carry the newest slave nodes. e.x. The log : ``` I20250219 16:15:58.919366 1260197 vtablet_writer.cpp:988] VNodeChannel[151455-10002], load_id=fa58b8bcfada49e4-827999f3d6e97fe5, txn_id=40574, node=1 0.0.19.244:8060 mark closed, left pending batch size: 1 I20250219 16:15:58.921607 1260201 vrow_distribution.cpp:98] [DEBUG] VRowDistribution::automatic_create_partition, request: TCreatePartitionRequest(txn_id=40574, db_id=11403, table_id=151454, partitionValues=[[TNullableStringLiteral(value=xxxx, is_null=0)]]) I20250219 16:15:58.925675 1256625 tablets_channel.cpp:180] [DEBUG] BaseTabletsChannel::incremental_open TabletsChannelKey id { hi: -407372644475057692 lo: -9045029104035201051 } W20250219 16:15:58.946277 1256557 tablets_channel.cpp:400] [DEBUG] TabletsChannel::_commit_txn TabletsChannelKey id { hi: -407372644475057692 lo: -9045029104035201051 } F20250219 17:51:13.139384 14966 global.cpp:309] /data/home/lambxu/work/git/doris-master/doris/thirdparty/installed/include/google/protobuf/map.h:1293 CHECK failed: it != end(): key not found: 184236 ``` I add some debug log, you can see that after node channel mark close, the incremental open request arrive before the add block request. Then the new DeltaWriter create by incremental open could not find the slave nodes, be crash. Problem Summary: ### Release note None ### Check List (For Author) - Test <!-- At least one of them must be included. --> - [x] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [x] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [x] No. - [ ] Yes. <!-- Add document PR link here. eg: https://github.com/apache/doris-website/pull/1214 --> ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org