liutang123 opened a new issue #7545:
URL: https://github.com/apache/incubator-doris/issues/7545


   When the instance execution ends, it will report to FE if the execution 
fails.
   
   When we set `thrift_connect_timeout_seconds` a large number, due to lock 
contention and long creation time of the fe thrift client, the `client_cache` 
may prevent instance from exiting.
   
   I added some logs to detect this case
   ```
   I1224 11:32:26.279561 132304 plan_fragment_executor.cpp:241] Open(): 
fragment_instance_id=ec8e96fe3cf940fe-ab3746d86e4e8ce7
   W1224 11:32:40.074090 132304 data_stream_sender.cpp:151] failed to send brpc 
batch, error=Host is down, error_text=[E104]Fail to read from Socket{id=257 
fd=1814 a
   ddr=10.xxxxxxxxx0:8060:13524} (0x0x1b2094200): Connection reset by peer 
[R1][E112]Not connected to 10.xxxxxxxxx:8060 yet, server_id=257 [R2][E112]Not 
connected to
   10.xxxxxxxx:8060 yet, server_id=257 [R3][E112]Not connected to 
10.xxxxxxxx:8060 yet, server_id=257, client: 10.xxxxxxx
   W1224 11:32:40.125466 132304 data_stream_sender.cpp:151] failed to send brpc 
batch, error=Host is down, error_text=[E104]Fail to read from Socket{id=257 
fd=1814 addr=10.xxxxxxx:8060:13524} (0x0x1b2094200): Connection reset by peer 
[R1][E112]Not connected to 10.xxxxxx:8060 yet, server_id=257 [R2][E112]Not 
connected to
   10.xxxxxxx:8060 yet, server_id=257 [R3][E112]Not connected to 10.xxxxx:8060 
yet, server_id=257, client: 10.xxxxxxx
   ....
   W1224 11:32:40.564545 132304 data_stream_sender.cpp:151] failed to send brpc 
batch, error=Host is down, error_text=[E104]Fail to read from Socket{id=257 
fd=1814 addr=10.xxxxxx:8060:13524} (0x0x1b2094200): Connection reset by peer 
[R1][E112]Not connected to 10.xxxxxx:8060 yet, server_id=257 [R2][E112]Not 
connected to 10.xxxxxx:8060 yet, server_id=257 [R3][E112]Not connected to 
10.xxxxx:8060 yet, server_id=257, client: 10.xxxxxxxx
   W1224 11:33:40.471328 132304 client_cache.h:134] get_client get lock cost 
cost59906 ms
   W1224 11:34:40.532156 132304 client_cache.h:134] get_client create_client 
cost 60060 ms
   ```
   
   ```
   W1224 10:22:45.245872 69034 client_cache.h:134] get_client create_client 
cost 60,060 ms
   W1224 10:25:34.010275 67723 client_cache.h:134] get_client create_client 
cost 168,039 ms
   W1224 10:27:57.512188 74348 client_cache.h:134] get_client create_client 
cost 143,366 ms
   W1224 10:29:53.459988 67588 client_cache.h:134] get_client create_client 
cost 113,272 ms
   W1224 10:31:39.886162 68310 client_cache.h:134] get_client create_client 
cost 105,604 ms
   W1224 10:32:41.749461 68543 client_cache.h:134] get_client create_client 
cost 60,015 ms
   W1224 10:33:43.985551 69362 client_cache.h:134] get_client create_client 
cost 60, 060 ms
   W1224 10:34:45.581478 73557 client_cache.h:134] get_client create_client 
cost 60001 ms
   W1224 10:35:48.613260 75902 client_cache.h:134] get_client create_client 
cost 60013 ms
   W1224 10:36:48.968451 76243 client_cache.h:134] get_client create_client 
cost 60008 ms
   W1224 10:37:49.186419 76255 client_cache.h:134] get_client create_client 
cost 60060 ms
   W1224 10:38:55.576458 84068 client_cache.h:134] get_client create_client 
cost 60025 ms
   W1224 10:39:57.009445 109549 client_cache.h:134] get_client create_client 
cost 60014 ms
   W1224 11:33:40.434983 132399 client_cache.h:134] get_client create_client 
cost 60060 ms
   W1224 11:34:40.532156 132304 client_cache.h:134] get_client create_client 
cost 60060 ms
   W1224 11:57:45.403156 154359 client_cache.h:134] get_client create_client 
cost 107756 ms
   W1224 12:00:00.658280 149799 client_cache.h:134] get_client create_client 
cost 134611 ms
   W1224 12:01:49.407203 150261 client_cache.h:134] get_client create_client 
cost10,8342 ms
   ```
   
   ```
   I1224 11:53:03.466200 109783 plan_fragment_executor.cpp:77] Prepare(): 
query_id=4503d9a4f5304a4f-99fcdf4bbcd2e9d7 
fragment_instance_id=4503d9a4f5304a4f-99fcdf4bbcd2f4d7 backend_num=3072 thread 
id by pthread_self()=139683259016960
   I1224 11:53:03.468600 152520 fragment_mgr.cpp:482] _exec_actual(): 
query_id=4503d9a4f5304a4f-99fcdf4bbcd2e9d7 
fragment_instance_id=4503d9a4f5304a4f-99fcdf4bbcd2f4d7 thread id by 
pthread_self()=139672295827200
   I1224 11:53:03.468643 152520 plan_fragment_executor.cpp:241] Open(): 
fragment_instance_id=4503d9a4f5304a4f-99fcdf4bbcd2f4d7
   I1224 11:55:57.685446 109774 internal_service.cpp:169] cancel fragment, 
fragment_instance_id=4503d9a4f5304a4f-99fcdf4bbcd2f4d7, reason: 3
   I1224 11:55:57.685490 109774 plan_fragment_executor.cpp:525] cancel(): 
fragment_instance_id=4503d9a4f5304a4f-99fcdf4bbcd2f4d7FragmentExecState::cancel
   I1224 11:55:57.697355 109497 plan_fragment_executor.cpp:525] cancel(): 
fragment_instance_id=4503d9a4f5304a4f-99fcdf4bbcd2f4d7 FragmentExecState::cancel
   W1224 12:01:50.092286 152520 client_cache.h:134] get_client get lock cost 
cost 351,517 ms
   W1224 12:01:50.094197 152520 fragment_mgr.cpp:162] coordinator_callback 
create FrontendServiceConnection cost 351,519 ms
   I1224 12:01:50.100473 152520 plan_fragment_executor.cpp:525] cancel(): 
fragment_instance_id=4503d9a4f5304a4f-99fcdf4bbcd2f4d7 coordinator_callback 2
   W1224 12:03:41.310636 152520 client_cache.h:134] release_client get lock 
cost 111,210 ms
   W1224 12:03:41.311704 152520 fragment_mgr.cpp:257] Got error while opening 
fragment 4503d9a4f5304a4f-99fcdf4bbcd2f4d7: Cancelled: Cancelled
   I1224 12:03:41.324925 152520 plan_fragment_executor.cpp:590] Fragment 
4503d9a4f5304a4f-99fcdf4bbcd2f4d7:(Ac
   ```
   
   The stack is 
   ```
   Thread 936 (Thread 0x7f0582448700 (LWP 154359)):
   #0  0x00007f0b44d3dc3d in poll () from /lib64/libc.so.6
   #1  0x0000000001b3e97b in apache::thrift::transport::TSocket::openConnection 
(this=this@entry=0x4bfd0bdc0, res=res@entry=0xbf5b755c0) at 
src/thrift/transport/TSocket.cpp:345
   #2  0x0000000001b3f513 in apache::thrift::transport::TSocket::local_open 
(this=0x4bfd0bdc0) at src/thrift/transport/TSocket.cpp:452
   #3  0x000000000115f643 in doris::ThriftClientImpl::open 
(this=this@entry=0xf079cc4d0) at palo/be/src/util/thrift_client.cpp:33
   #4  0x000000000107d777 in 
doris::ClientCacheHelper::create_client(doris::TNetworkAddress const&, 
boost::function<doris::ThriftClientImpl* (doris::TNetworkAddress const&, 
void**)>, void**, int) (this=this@entry=0x10c3c1860, hostport=..., 
factory_method=..., client_key=client_key@entry=0x7f0582443e38, 
timeout_ms=timeout_ms@entry=0) at palo/be/src/runtime/client_cache.cpp:139
   #5  0x000000000107efb2 in 
doris::ClientCacheHelper::get_client(doris::TNetworkAddress const&, 
boost::function<doris::ThriftClientImpl* (doris::TNetworkAddress const&, 
void**)>, void**, int) (this=this@entry=0x10c3c1860, hostport=..., 
factory_method=..., client_key=client_key@entry=0x7f0582443e38, 
timeout_ms=timeout_ms@entry=0) at palo/be/src/runtime/client_cache.cpp:76
   #6  0x000000000100b1f7 in get_client (timeout_ms=0, iface=0x7f0582443e38, 
hostport=..., this=0x10c3c1860) at palo/be/src/runtime/client_cache.h:265
   #7  ClientConnection (status=<synthetic pointer>, address=<error reading 
variable: access outside bounds of object referenced via synthetic pointer>, 
client_cache=0x10c3c1860, this=0x7f0582443e30) at 
palo/be/src/runtime/client_cache.h:164
   #8  doris::FragmentExecState::coordinator_callback (this=0x10b8c52000, 
status=..., profile=0x65e29f848, done=<optimized out>) at 
palo/be/src/runtime/fragment_mgr.cpp:310
   #9  0x0000000001085c6f in operator() (a2=<optimized out>, a1=<optimized 
out>, a0=..., this=<optimized out>) at 
palo/thirdparty/installed/include/boost/function/function_template.hpp:759
   #10 doris::PlanFragmentExecutor::send_report (this=this@entry=0x10b8c52070, 
done=done@entry=true) at palo/be/src/runtime/plan_fragment_executor.cpp:439
   #11 0x0000000001085d93 in doris::PlanFragmentExecutor::update_status 
(this=this@entry=0x10b8c52070, new_status=...) at 
palo/be/src/runtime/plan_fragment_executor.cpp:517
   #12 0x000000000108783a in doris::PlanFragmentExecutor::open 
(this=this@entry=0x10b8c52070) at 
palo/be/src/runtime/plan_fragment_executor.cpp:265
   #13 0x000000000100970e in doris::FragmentExecState::execute 
(this=0x10b8c52000) at palo/be/src/runtime/fragment_mgr.cpp:256
   ```
   
   When a big query with a lot of instance fail, the instance thread number 
will become very big.
   
   
![image](https://user-images.githubusercontent.com/17537020/147742524-72c1689c-f8df-44b5-bacd-69b7c2542cda.png)
   
![image](https://user-images.githubusercontent.com/17537020/147742551-f5f7d9ce-c206-47ae-ac3f-d51d203b8bec.png)
   
   We can optimize the lock logic  in `client_cache` to avoid the creation of a 
single FE thrift client blocking the exit of other instance threads.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

Reply via email to