morningman opened a new issue #3871:
URL: https://github.com/apache/incubator-doris/issues/3871


   **Describe the bug**
   BE crash with coredump like:
   
   ```
   (gdb) bt
   #0  0x0000000000000000 in ?? ()
   #1  0x0000000001045667 in release (bytes=8192, this=0x633b7d40) at 
/var/doris/palo-core/be/src/runtime/mem_tracker.h:221
   #2  doris::RowBatch::clear (this=0x8fc6bce0) at 
/var/doris/palo-core/be/src/runtime/row_batch.cpp:294
   #3  0x00000000010459dd in clear (this=0x8fc6bce0) at 
/var/doris/palo-core/be/src/common/object_pool.h:54
   #4  doris::RowBatch::~RowBatch (this=0x8fc6bce0, __in_chrg=<optimized out>) 
at /var/doris/palo-core/be/src/runtime/row_batch.cpp:301
   #5  0x0000000001045a01 in doris::RowBatch::~RowBatch (this=0x8fc6bce0, 
__in_chrg=<optimized out>) at 
/var/doris/palo-core/be/src/runtime/row_batch.cpp:302
   #6  0x00000000014a0299 in operator() (this=<optimized out>, __ptr=<optimized 
out>) at 
/var/doris/doris-toolchain/gcc730/include/c++/7.3.0/bits/unique_ptr.h:78
   #7  ~unique_ptr (this=0x7070ec98, __in_chrg=<optimized out>) at 
/var/doris/doris-toolchain/gcc730/include/c++/7.3.0/bits/unique_ptr.h:268
   #8  doris::stream_load::NodeChannel::~NodeChannel (this=0x7070ec00, 
__in_chrg=<optimized out>) at 
/var/doris/palo-core/be/src/exec/tablet_sink.cpp:41
   #9  0x00000000014a53f2 in ~SpecificElement (this=0x77645500, 
__in_chrg=<optimized out>) at 
/var/doris/palo-core/be/src/common/object_pool.h:75
   #10 
doris::ObjectPool::SpecificElement<doris::stream_load::NodeChannel>::~SpecificElement
 (this=0x77645500, __in_chrg=<optimized out>) at 
/var/doris/palo-core/be/src/common/object_pool.h:76
   #11 0x0000000001052efd in clear (this=0x6a213220) at 
/var/doris/palo-core/be/src/common/object_pool.h:54
   #12 ~ObjectPool (this=0x6a213220, __in_chrg=<optimized out>) at 
/var/doris/palo-core/be/src/common/object_pool.h:37
   #13 std::_Sp_counted_ptr<doris::ObjectPool*, 
(__gnu_cxx::_Lock_policy)2>::_M_dispose (this=<optimized out>) at 
/var/doris/doris-toolchain/gcc730/include/c++/7.3.0/bits/shared_ptr_base.h:376
   #14 0x000000000104b591 in _M_release (this=0x6a212fa0) at 
/var/doris/doris-toolchain/gcc730/include/c++/7.3.0/bits/shared_ptr_base.h:154
   #15 ~__shared_count (this=0x891de310, __in_chrg=<optimized out>) at 
/var/doris/doris-toolchain/gcc730/include/c++/7.3.0/bits/shared_ptr_base.h:684
   #16 ~__shared_ptr (this=0x891de308, __in_chrg=<optimized out>) at 
/var/doris/doris-toolchain/gcc730/include/c++/7.3.0/bits/shared_ptr_base.h:1123
   #17 ~shared_ptr (this=0x891de308, __in_chrg=<optimized out>) at 
/var/doris/doris-toolchain/gcc730/include/c++/7.3.0/bits/shared_ptr.h:93
   #18 doris::RuntimeState::~RuntimeState (this=0x891de300, 
__in_chrg=<optimized out>) at 
/var/doris/palo-core/be/src/runtime/runtime_state.cpp:128
   #19 0x000000000103dcc6 in checked_delete<doris::RuntimeState> (x=0x891de300) 
at /var/doris/thirdparty/installed/include/boost/core/checked_delete.hpp:34
   #20 ~scoped_ptr (this=0x88082ce0, __in_chrg=<optimized out>) at 
/var/doris/thirdparty/installed/include/boost/smart_ptr/scoped_ptr.hpp:89
   #21 doris::PlanFragmentExecutor::~PlanFragmentExecutor (this=0x88082b70, 
__in_chrg=<optimized out>) at 
/var/doris/palo-core/be/src/runtime/plan_fragment_executor.cpp:62
   #22 0x0000000000fd39ed in doris::FragmentExecState::~FragmentExecState 
(this=0x88082b00, __in_chrg=<optimized out>) at 
/var/doris/palo-core/be/src/runtime/fragment_mgr.cpp:175
   ```
   
   **Debug**
   
   In some abnormal situations, performing Insert to load will cause BE to 
crash. The reasons for troubleshooting are as follows:
   
   * FE side:
   
   
https://github.com/apache/incubator-doris/blob/2211cb0ee0fcd23d4fd2445494aba6cf1a020987/fe/src/main/java/org/apache/doris/qe/Coordinator.java#L475-L489
   
   During the execution of `execState.execRemoteFragmentAsync()`, if an RPC 
error occurs, if the corresponding BE is down, an exception will be thrown 
directly instead of returning the error via `Future<PExecPlanFragmentResult>`. 
This time, the Coordinator will not proceed with the subsequent `Cancel` 
operation.
   
   After an exception is thrown, the thread stack returns to 
`handleInsertStmt()` and directly returns the user error message. Insert 
failed. So far, FE has no further processing.
   
   * BE side
   
   BE receives the execution plan of Insert and calls `_sink->open` in 
`PlanFragmentExecutor::open_internal()` to open `TabletSink`.
   
   
https://github.com/apache/incubator-doris/blob/2211cb0ee0fcd23d4fd2445494aba6cf1a020987/be/src/runtime/plan_fragment_executor.cpp#L272-L292
   
   The Open method of `TabletSink` will open all related NodeChannels via RPC, 
but because some BEs problem, some of NodeChannels fail to open. An error 
message appears in the BE log:
   
   ```
   tablet open failed, load_id=4e2f4d6076cc4692-b8f26cc924b6be0d, 
txn_id144942945, node=10.xx.xx.xx:8060, errmsg=failed to open tablet writer
   ```
   
   However, because the majority of NodeChannels were successfully opened, 
TabletSink Open operations returned success.
   
   Next, in `PlanFragmentExecutor::open_internal()` will call 
`get_next_internal()` to start fetching data, because Insert already failed at 
this time, so it returns failure here. **And there are bugs in the later 
`update_status(status)` method**:
   
   
https://github.com/apache/incubator-doris/blob/2211cb0ee0fcd23d4fd2445494aba6cf1a020987/be/src/runtime/plan_fragment_executor.cpp#L485-L503
   
   Line 493: `if (_status.ok())` should be `if (!_status.ok())`. This error 
caused the `_status` variable not to be updated. This will cause the 
NodeChannel to be closed instead of being canceled when the TabletSink is 
finally closed.
   
   The normal NodeChannel close operation will send the last batch of RowBatch 
it holds and destroy the RowBatch object. And because some NodeChannels were 
not opened normally, they will not be closed normally. **The RowBatch held by 
these NodeChannels will not be destroyed**.
   
   After the execution of the entire plan is completed, the 
PlanFragmentExecutor's destruction process is entered. The destructor call 
chain is as follows:
   
   ```
   PlanFragmentExecutor
   |--- RuntimeState
        |--- RuntimeProfile
        |--- ObjectPool
             |--- NodeChannel
                  |--- RowBatch
                       |--- MemTracker->release()
                            |--- profile->_consumption->add(-bytes)
   ```
   
   Note that the RuntimeProfile will be first destructed in RuntimeState, which 
will cause the `_consumption` object that has been destructed to be called when 
destructing the RowBatch in the NodeChannel, which will eventually cause BE 
crash.
   
   
   **The whole process has the following problems**:
   
   1. The bug in `update_status(status)` caused the NodeChannel to not be 
canceled correctly, which caused problems in the final destruction. (If Cancel 
of NodeChannel is called in advance, RowBatch will be destructed in advance).
   2. When the FE Coordinator executes `execRemoteFragmentAsync()`, if it finds 
an RPC error, it should return a Future with an error code, continue the 
process afterwards, and actively call Cancel().
   3. The `_status` in RuntimeState has no lock protection, which may cause 
some potential problems.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

Reply via email to