xy720 opened a new pull request, #48101:
URL: https://github.com/apache/doris/pull/48101

   ### What problem does this PR solve?
   
   coredump:
   
   ```
   F20250219 16:11:04.432318 1252192 global.cpp:309] 
/data/home/lambxu/work/git/doris-tencent/doris-2.1/thirdparty/installed/include/google/protobuf/map
   .h:1293 CHECK failed: it != end(): key not found: 184236
   *** Check failure stack trace: ***
       @     0x5626cb1376e6  google::LogMessage::SendToLog()
       @     0x5626cb134130  google::LogMessage::Flush()
       @     0x5626cb137f29  google::LogMessageFatal::~LogMessageFatal()
       @     0x5626ccc270ea  (unknown)
       @     0x5626cb86bfa5  google::protobuf::internal::LogMessage::Finish()
       @     0x5626c15f6e1e  google::protobuf::Map<>::at<>()
       @     0x5626c15f3604  doris::TabletsChannel::_commit_txn()
       @     0x5626c15f2f8b  doris::TabletsChannel::close()
       @     0x5626c14fb43a  doris::LoadChannel::_handle_eos()
       @     0x5626c14fb0f2  doris::LoadChannel::add_batch()
       @     0x5626c14f5800  doris::LoadChannelMgr::add_batch()
       @     0x5626c1667811  std::_Function_handler<>::_M_invoke()
       @     0x5626c168199b  doris::WorkThreadPool<>::work_thread()
       @     0x5626cdf081a0  execute_native_thread_routine
       @     0x7efdc1a25215  start_thread
       @     0x7efdc1aa7bdc  __clone3
       @              (nil)  (unknown)
   *** Query id: 9c94fa66404748-ab7987b4563a6318 ***
   *** is nereids: 0 ***
   *** tablet id: 0 ***
   *** Aborted at 1739952664 (unix time) try "date -d @1739952664" if you are 
using GNU date ***
   *** Current BE git commitID: f112af0fd2 ***
   *** SIGABRT unknown detail explain (@0x3e8001317ab) received by PID 1251243 
(TID 1252192 OR 0x7efbc38bd6c0) from PID 1251243; stack trace: ***
    0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, 
siginfo_t*, void*) at 
/data/home/lambxu/work/git/doris-tencent/doris-2.1/be/src/c
   ommon/signal_handler.h:421
    1# 0x00007EFDC19D7AD0 in /lib64/libc.so.6
    2# __pthread_kill_implementation in /lib64/libc.so.6
    3# raise in /lib64/libc.so.6
    4# __GI_abort in /lib64/libc.so.6
    5# 0x00005626CB141FBD in /usr/local/service/doris/lib/be/doris_be
    6# 0x00005626CB1345FA in /usr/local/service/doris/lib/be/doris_be
    7# google::LogMessage::SendToLog() in 
/usr/local/service/doris/lib/be/doris_be
    8# google::LogMessage::Flush() in /usr/local/service/doris/lib/be/doris_be
    9# google::LogMessageFatal::~LogMessageFatal() in 
/usr/local/service/doris/lib/be/doris_be
   10# 0x00005626CCC270EA in /usr/local/service/doris/lib/be/doris_be
   11# google::protobuf::internal::LogMessage::Finish() in 
/usr/local/service/doris/lib/be/doris_be
   12# doris::PSlaveTabletNodes const& google::protobuf::Map<long, 
doris::PSlaveTabletNodes>::at<long>(long const&) const in 
/usr/local/service/doris/li
   b/be/doris_be
   13# doris::TabletsChannel::_commit_txn(doris::DeltaWriter*, 
doris::PTabletWriterAddBlockRequest const&, 
doris::PTabletWriterAddBlockResult*) in /usr/
   local/service/doris/lib/be/doris_be
   14# doris::TabletsChannel::close(doris::LoadChannel*, 
doris::PTabletWriterAddBlockRequest const&, 
doris::PTabletWriterAddBlockResult*, bool*) at /dat
   
a/home/lambxu/work/git/doris-tencent/doris-2.1/be/src/runtime/tablets_channel.cpp:367
   15# doris::LoadChannel::_handle_eos(doris::BaseTabletsChannel*, 
doris::PTabletWriterAddBlockRequest const&, 
doris::PTabletWriterAddBlockResult*) at /
   
data/home/lambxu/work/git/doris-tencent/doris-2.1/be/src/runtime/load_channel.cpp:191
   16# doris::LoadChannel::add_batch(doris::PTabletWriterAddBlockRequest 
const&, doris::PTabletWriterAddBlockResult*) at /data/home/lambxu/work/git/dori
   s-tencent/doris-2.1/be/src/runtime/load_channel.cpp:172
   17# doris::LoadChannelMgr::add_batch(doris::PTabletWriterAddBlockRequest 
const&, doris::PTabletWriterAddBlockResult*) at /data/home/lambxu/work/git/d
   oris-tencent/doris-2.1/be/src/runtime/load_channel_mgr.cpp:156
   18# std::_Function_handler<void (), 
doris::PInternalServiceImpl::tablet_writer_add_block(google::protobuf::RpcController*,
 doris::PTabletWriterAddBlo
   ckRequest const*, doris::PTabletWriterAddBlockResult*, 
google::protobuf::Closure*)::$_0>::_M_invoke(std::_Any_data const&) at 
/data/home/lambxu/insta
   
lls/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:291
   19# doris::WorkThreadPool<false>::work_thread(int) at 
/data/home/lambxu/work/git/doris-tencent/doris-2.1/be/src/util/work_thread_pool.hpp:159
   20# execute_native_thread_routine at 
../../../../../libstdc++-v3/src/c++11/thread.cc:84
   21# start_thread in /lib64/libc.so.6
   22# __GI___clone3 in /lib64/libc.so.6
   ```
   
   This is because the PTabletWriterAddBlockRequest sent from coordinator Be 
may not carry the newest slave nodes.
   e.x. The log :
   
   ```
   I20250219 16:15:58.919366 1260197 vtablet_writer.cpp:988] 
VNodeChannel[151455-10002], load_id=fa58b8bcfada49e4-827999f3d6e97fe5, 
txn_id=40574, node=1
   0.0.19.244:8060 mark closed, left pending batch size: 1
   I20250219 16:15:58.921607 1260201 vrow_distribution.cpp:98] [DEBUG] 
VRowDistribution::automatic_create_partition, request: 
TCreatePartitionRequest(txn_id=40574, db_id=11403, table_id=151454, 
partitionValues=[[TNullableStringLiteral(value=xxxx, is_null=0)]])
   I20250219 16:15:58.925675 1256625 tablets_channel.cpp:180] [DEBUG] 
BaseTabletsChannel::incremental_open
   TabletsChannelKey id {
     hi: -407372644475057692
     lo: -9045029104035201051
   }
   W20250219 16:15:58.946277 1256557 tablets_channel.cpp:400] [DEBUG] 
TabletsChannel::_commit_txn
   TabletsChannelKey id {
     hi: -407372644475057692
     lo: -9045029104035201051
   }
   F20250219 17:51:13.139384 14966 global.cpp:309] 
/data/home/lambxu/work/git/doris-master/doris/thirdparty/installed/include/google/protobuf/map.h:1293
    CHECK failed: it != end(): key not found: 184236
   ```
   
   I add some debug log, you can see that after node channel mark close, the 
incremental open request arrive before the add block request.
   
   Then the new DeltaWriter create by incremental open could not find the slave 
nodes, be crash.
   
   Problem Summary:
   
   ### Release note
   
   None
   
   ### Check List (For Author)
   
   - Test <!-- At least one of them must be included. -->
       - [x] Regression test
       - [ ] Unit Test
       - [ ] Manual test (add detailed scripts or steps below)
       - [ ] No need to test or manual test. Explain why:
           - [ ] This is a refactor/code format and no logic has been changed.
           - [ ] Previous test can cover this change.
           - [ ] No code files have been changed.
           - [ ] Other reason <!-- Add your reason?  -->
   
   - Behavior changed:
       - [x] No.
       - [ ] Yes. <!-- Explain the behavior change -->
   
   - Does this need documentation?
       - [x] No.
       - [ ] Yes. <!-- Add document PR link here. eg: 
https://github.com/apache/doris-website/pull/1214 -->
   
   ### Check List (For Reviewer who merge this PR)
   
   - [ ] Confirm the release note
   - [ ] Confirm test cases
   - [ ] Confirm document
   - [ ] Add branch pick label <!-- Add branch pick label that this PR should 
merge into -->
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

Reply via email to