zhiqiang-hhhh opened a new pull request, #40495:
URL: https://github.com/apache/doris/pull/40495

   We have dead lock when submit scanner to scheduler failed.
   
   pstack looks like
   ```txt
   Thread 2012 (Thread 0x7f87363fb700 (LWP 4179707) "Pipe_normal [wo"):
   #0  0x00007f8b8f3dc82d in __lll_lock_wait () from /lib64/libpthread.so.0
   #1  0x00007f8b8f3d5ad9 in pthread_mutex_lock () from /lib64/libpthread.so.0
   #2  0x000055b20f333e7a in __gthread_mutex_lock (__mutex=0x7f8733d960a8) at 
/mnt/disk1/hezhiqiang/toolchains/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/x86_64-linux-gnu/c++/11/bits/gthr-default
   .h:749
   #3  std::mutex::lock (this=0x7f8733d960a8) at 
/mnt/disk1/hezhiqiang/toolchains/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_mutex.h:100
   #4  std::lock_guard<std::mutex>::lock_guard (__m=..., this=<optimized out>) 
at 
/mnt/disk1/hezhiqiang/toolchains/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_mutex.h:229
   #5  doris::vectorized::ScannerContext::append_block_to_queue 
(this=<optimized out>, scan_task=...) at 
/mnt/disk1/hezhiqiang/doris/be/src/vec/exec/scan/scanner_context.cpp:234
   #6  0x000055b20f32c0f9 in doris::vectorized::ScannerScheduler::submit 
(this=<optimized out>, ctx=..., scan_task=...) at 
/mnt/disk1/hezhiqiang/doris/be/src/vec/exec/scan/scanner_scheduler.cpp:209
   #7  0x000055b20f3338fc in 
doris::vectorized::ScannerContext::submit_scan_task 
(this=this@entry=0x7f8733d96010, scan_task=...) at 
/mnt/disk1/hezhiqiang/doris/be/src/vec/exec/scan/scanner_context.cpp:217
   #8  0x000055b20f3346cd in 
doris::vectorized::ScannerContext::get_block_from_queue (this=0x7f8733d96010, 
state=<optimized out>, block=0x7f871f728de0, eos=0x7f871abce470, id=<optimized 
out>) at 
/mnt/disk1/hezhiqiang/doris/be/src/vec/exec/scan/scanner_context.cpp:290
   #9  0x000055b214cb4f13 in 
doris::pipeline::ScanOperatorX<doris::pipeline::OlapScanLocalState>::get_block 
(this=<optimized out>, state=0x7f872f0eb400, block=0x7f8b8f3dc82d 
<__lll_lock_wait+29>, eos=0x7f871abce470) at 
/mnt/disk1/hezhiqiang/doris/be/src/pipeline/exec/scan_operator.cpp:1292
   #10 0x000055b2142b5772 in 
doris::pipeline::ScanOperatorX<doris::pipeline::OlapScanLocalState>::get_block_after_projects
 (this=0x80, state=0x0, block=0x7f8b8f3dc82d <__lll_lock_wait+29>, 
eos=0x7f8733d960a8) at 
/mnt/disk1/hezhiqiang/doris/be/src/pipeline/exec/scan_operator.h:363
   #11 0x000055b2142e7880 in 
doris::pipeline::StatefulOperatorX<doris::pipeline::StreamingAggLocalState>::get_block
 (this=0x7f871f9bee00, state=0x7f872f0eb400, block=0x7f8716d49060, 
eos=0x7f87363f4937) at 
/mnt/disk1/hezhiqiang/doris/be/src/pipeline/exec/operator.cpp:587
   ```
   Deallock happens with following
   ```cpp
   Status ScannerContext::get_block_from_queue {
        std::unique_lock l(_transfer_lock);
        ...
        if (scan_task->is_eos()) {
        ...
        } else {
             // resubmit current running scanner to read the next block
            submit_scan_task(scan_task);
        }
   }
   
   ScannerContext::submit_scan_task(std::shared_ptr<ScanTask> scan_task) {
        _scanner_scheduler->submit(shared_from_this(), scan_task);
   }
   
   void ScannerScheduler::submit(std::shared_ptr<ScannerContext> ctx,
                                 std::shared_ptr<ScanTask> scan_task) {
       ...
       if (auto ret = sumbit_task(); !ret) {
           scan_task->set_status(Status::InternalError(
                   "Failed to submit scanner to scanner pool reason:" + 
std::string(ret.msg()) +
                   "|type:" + std::to_string(type)));
           ctx->append_block_to_queue(scan_task);
           return;
       }
   }
   
   void ScannerContext::append_block_to_queue(std::shared_ptr<ScanTask> 
scan_task) {
       ...
       std::lock_guard<std::mutex> l(_transfer_lock);
       ...
   }
   ```
   Since mutex in cpp is not re-enterable, so the scanner thread will deadlock 
with itself.
   
   This pr fix the problem by making `ScannerScheduler::submit` return a Status 
instead of doing append failed task to the ScannerContext. The caller itself 
will decide where resubmit the scanner or just abort the execution of the query.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

Reply via email to