morningman opened a new issue #3871: URL: https://github.com/apache/incubator-doris/issues/3871
**Describe the bug** BE crash with coredump like: ``` (gdb) bt #0 0x0000000000000000 in ?? () #1 0x0000000001045667 in release (bytes=8192, this=0x633b7d40) at /var/doris/palo-core/be/src/runtime/mem_tracker.h:221 #2 doris::RowBatch::clear (this=0x8fc6bce0) at /var/doris/palo-core/be/src/runtime/row_batch.cpp:294 #3 0x00000000010459dd in clear (this=0x8fc6bce0) at /var/doris/palo-core/be/src/common/object_pool.h:54 #4 doris::RowBatch::~RowBatch (this=0x8fc6bce0, __in_chrg=<optimized out>) at /var/doris/palo-core/be/src/runtime/row_batch.cpp:301 #5 0x0000000001045a01 in doris::RowBatch::~RowBatch (this=0x8fc6bce0, __in_chrg=<optimized out>) at /var/doris/palo-core/be/src/runtime/row_batch.cpp:302 #6 0x00000000014a0299 in operator() (this=<optimized out>, __ptr=<optimized out>) at /var/doris/doris-toolchain/gcc730/include/c++/7.3.0/bits/unique_ptr.h:78 #7 ~unique_ptr (this=0x7070ec98, __in_chrg=<optimized out>) at /var/doris/doris-toolchain/gcc730/include/c++/7.3.0/bits/unique_ptr.h:268 #8 doris::stream_load::NodeChannel::~NodeChannel (this=0x7070ec00, __in_chrg=<optimized out>) at /var/doris/palo-core/be/src/exec/tablet_sink.cpp:41 #9 0x00000000014a53f2 in ~SpecificElement (this=0x77645500, __in_chrg=<optimized out>) at /var/doris/palo-core/be/src/common/object_pool.h:75 #10 doris::ObjectPool::SpecificElement<doris::stream_load::NodeChannel>::~SpecificElement (this=0x77645500, __in_chrg=<optimized out>) at /var/doris/palo-core/be/src/common/object_pool.h:76 #11 0x0000000001052efd in clear (this=0x6a213220) at /var/doris/palo-core/be/src/common/object_pool.h:54 #12 ~ObjectPool (this=0x6a213220, __in_chrg=<optimized out>) at /var/doris/palo-core/be/src/common/object_pool.h:37 #13 std::_Sp_counted_ptr<doris::ObjectPool*, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=<optimized out>) at /var/doris/doris-toolchain/gcc730/include/c++/7.3.0/bits/shared_ptr_base.h:376 #14 0x000000000104b591 in _M_release (this=0x6a212fa0) at /var/doris/doris-toolchain/gcc730/include/c++/7.3.0/bits/shared_ptr_base.h:154 #15 ~__shared_count (this=0x891de310, __in_chrg=<optimized out>) at /var/doris/doris-toolchain/gcc730/include/c++/7.3.0/bits/shared_ptr_base.h:684 #16 ~__shared_ptr (this=0x891de308, __in_chrg=<optimized out>) at /var/doris/doris-toolchain/gcc730/include/c++/7.3.0/bits/shared_ptr_base.h:1123 #17 ~shared_ptr (this=0x891de308, __in_chrg=<optimized out>) at /var/doris/doris-toolchain/gcc730/include/c++/7.3.0/bits/shared_ptr.h:93 #18 doris::RuntimeState::~RuntimeState (this=0x891de300, __in_chrg=<optimized out>) at /var/doris/palo-core/be/src/runtime/runtime_state.cpp:128 #19 0x000000000103dcc6 in checked_delete<doris::RuntimeState> (x=0x891de300) at /var/doris/thirdparty/installed/include/boost/core/checked_delete.hpp:34 #20 ~scoped_ptr (this=0x88082ce0, __in_chrg=<optimized out>) at /var/doris/thirdparty/installed/include/boost/smart_ptr/scoped_ptr.hpp:89 #21 doris::PlanFragmentExecutor::~PlanFragmentExecutor (this=0x88082b70, __in_chrg=<optimized out>) at /var/doris/palo-core/be/src/runtime/plan_fragment_executor.cpp:62 #22 0x0000000000fd39ed in doris::FragmentExecState::~FragmentExecState (this=0x88082b00, __in_chrg=<optimized out>) at /var/doris/palo-core/be/src/runtime/fragment_mgr.cpp:175 ``` **Debug** In some abnormal situations, performing Insert to load will cause BE to crash. The reasons for troubleshooting are as follows: * FE side: https://github.com/apache/incubator-doris/blob/2211cb0ee0fcd23d4fd2445494aba6cf1a020987/fe/src/main/java/org/apache/doris/qe/Coordinator.java#L475-L489 During the execution of `execState.execRemoteFragmentAsync()`, if an RPC error occurs, if the corresponding BE is down, an exception will be thrown directly instead of returning the error via `Future<PExecPlanFragmentResult>`. This time, the Coordinator will not proceed with the subsequent `Cancel` operation. After an exception is thrown, the thread stack returns to `handleInsertStmt()` and directly returns the user error message. Insert failed. So far, FE has no further processing. * BE side BE receives the execution plan of Insert and calls `_sink->open` in `PlanFragmentExecutor::open_internal()` to open `TabletSink`. https://github.com/apache/incubator-doris/blob/2211cb0ee0fcd23d4fd2445494aba6cf1a020987/be/src/runtime/plan_fragment_executor.cpp#L272-L292 The Open method of `TabletSink` will open all related NodeChannels via RPC, but because some BEs problem, some of NodeChannels fail to open. An error message appears in the BE log: ``` tablet open failed, load_id=4e2f4d6076cc4692-b8f26cc924b6be0d, txn_id144942945, node=10.xx.xx.xx:8060, errmsg=failed to open tablet writer ``` However, because the majority of NodeChannels were successfully opened, TabletSink Open operations returned success. Next, in `PlanFragmentExecutor::open_internal()` will call `get_next_internal()` to start fetching data, because Insert already failed at this time, so it returns failure here. **And there are bugs in the later `update_status(status)` method**: https://github.com/apache/incubator-doris/blob/2211cb0ee0fcd23d4fd2445494aba6cf1a020987/be/src/runtime/plan_fragment_executor.cpp#L485-L503 Line 493: `if (_status.ok())` should be `if (!_status.ok())`. This error caused the `_status` variable not to be updated. This will cause the NodeChannel to be closed instead of being canceled when the TabletSink is finally closed. The normal NodeChannel close operation will send the last batch of RowBatch it holds and destroy the RowBatch object. And because some NodeChannels were not opened normally, they will not be closed normally. **The RowBatch held by these NodeChannels will not be destroyed**. After the execution of the entire plan is completed, the PlanFragmentExecutor's destruction process is entered. The destructor call chain is as follows: ``` PlanFragmentExecutor |--- RuntimeState |--- RuntimeProfile |--- ObjectPool |--- NodeChannel |--- RowBatch |--- MemTracker->release() |--- profile->_consumption->add(-bytes) ``` Note that the RuntimeProfile will be first destructed in RuntimeState, which will cause the `_consumption` object that has been destructed to be called when destructing the RowBatch in the NodeChannel, which will eventually cause BE crash. **The whole process has the following problems**: 1. The bug in `update_status(status)` caused the NodeChannel to not be canceled correctly, which caused problems in the final destruction. (If Cancel of NodeChannel is called in advance, RowBatch will be destructed in advance). 2. When the FE Coordinator executes `execRemoteFragmentAsync()`, if it finds an RPC error, it should return a Future with an error code, continue the process afterwards, and actively call Cancel(). 3. The `_status` in RuntimeState has no lock protection, which may cause some potential problems. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org