We often received errors due to socket time out on send/recv opeartion: slurm_load_jobs error: Socket timed out on send/recv operation slurm_load_node: Socket timed out on send/recv operation
What could cause the errors? How likely job_submit.lua could cause such errors? We have a program running every 2 seconds collect information of pending jobs. Does that program cause the errors? Our slurm version is 17.11 Some extra debug information from slurmctl.log: [2019-04-15T04:34:47.094] debug: Note large processing time from job_submit_plugin_submit: usec=1317325 began=04:34:45.777 [2019-04-15T04:34:47.098] debug: Note large processing time from _slurm_rpc_complete_prolog: usec=1240300 began=04:34:45.858 [2019-04-15T04:34:49.744] debug: Note large processing time from job_submit_plugin_submit: usec=1301871 began=04:34:48.442 [2019-04-15T04:34:56.541] debug: Note large processing time from job_submit_plugin_submit: usec=1258167 began=04:34:55.283 [2019-04-15T04:34:58.620] debug: Note large processing time from job_submit_plugin_submit: usec=1295753 began=04:34:57.324 [2019-04-15T04:34:58.823] debug: Note large processing time from _slurmctld_background: usec=1229287 began=04:34:57.581 [2019-04-15T04:35:00.013] debug: Note large processing time from job_submit_plugin_submit: usec=1252367 began=04:34:58.761 [2019-04-15T04:35:01.435] debug: Note large processing time from job_submit_plugin_submit: usec=1278561 began=04:35:00.156 [2019-04-15T04:35:02.843] debug: Note large processing time from job_submit_plugin_submit: usec=1263240 began=04:35:01.579 [2019-04-15T04:35:03.111] debug: Note large processing time from dump_all_job_state: usec=1108738 began=04:35:02.002 [2019-04-15T04:35:04.100] debug: Note large processing time from job_submit_plugin_submit: usec=1254256 began=04:35:02.846 [2019-04-15T04:35:05.335] debug: Note large processing time from job_submit_plugin_submit: usec=2488678 began=04:35:02.846 Output from sdiag: ******************************************************* sdiag output at Wed Apr 17 09:15:35 2019 (1555510535) Data since Tue Apr 16 19:00:00 2019 (1555459200) ******************************************************* Server thread count: 3 Agent queue size: 0 DBD Agent queue size: 0 Jobs submitted: 4907 Jobs started: 4900 Jobs completed: 4910 Jobs canceled: 28 Jobs failed: 0 Jobs running: 377 Jobs running ts: Wed Apr 17 09:15:13 2019 (1555510513) Main schedule statistics (microseconds): Last cycle: 2177 Max cycle: 289836 Total cycles: 9167 Mean cycle: 5660 Mean depth cycle: 32 Cycles per minute: 10 Last queue length: 27 Backfilling stats Total backfilled jobs (since last slurm start): 133491 Total backfilled jobs (since last stats cycle start): 984 Total backfilled heterogeneous job components: 0 Total cycles: 1691 Last cycle when: Wed Apr 17 09:15:12 2019 (1555510512) Last cycle: 51703 Max cycle: 699037 Mean cycle: 85826 Last depth cycle: 27 Last depth cycle (try sched): 27 Depth Mean: 33 Depth Mean (try depth): 33 Last queue length: 27 Queue length mean: 31 Remote Procedure Call statistics by message type REQUEST_JOB_INFO ( 2003) count:1826319 ave_time:25114 total_time:45866876387 REQUEST_PARTITION_INFO ( 2009) count:1290235 ave_time:401 total_time:518371152 REQUEST_FED_INFO ( 2049) count:1052360 ave_time:401 total_time:422773504 MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:868778 ave_time:37473 total_time:32555814249 REQUEST_JOB_USER_INFO ( 2039) count:704905 ave_time:22712 total_time:16010454603 REQUEST_JOB_INFO_SINGLE ( 2021) count:473505 ave_time:68461 total_time:32417060555 REQUEST_COMPLETE_PROLOG ( 6018) count:406364 ave_time:438771 total_time:178301089558 MESSAGE_EPILOG_COMPLETE ( 6012) count:405918 ave_time:237988 total_time:96603820959 REQUEST_STEP_COMPLETE ( 5016) count:403717 ave_time:215119 total_time:86847579110 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:366802 ave_time:368023 total_time:134991874450 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:318304 ave_time:2022617 total_time:643807267441 REQUEST_NODE_INFO ( 2007) count:67568 ave_time:38776 total_time:2620081867 REQUEST_PING ( 1008) count:55100 ave_time:293 total_time:16159038 REQUEST_JOB_STEP_CREATE ( 5001) count:37243 ave_time:18476 total_time:688122372 REQUEST_JOB_PACK_ALLOC_INFO ( 4027) count:36398 ave_time:36577 total_time:1331341719 REQUEST_KILL_JOB ( 5032) count:7821 ave_time:50206 total_time:392666200 REQUEST_CANCEL_JOB_STEP ( 5005) count:2342 ave_time:12425 total_time:29100941 REQUEST_BUILD_INFO ( 2001) count:574 ave_time:28934 total_time:16608216 REQUEST_JOB_NOTIFY ( 4022) count:547 ave_time:17239 total_time:9429961 ACCOUNTING_UPDATE_MSG (10001) count:457 ave_time:7636182 total_time:3489735519 REQUEST_NODE_INFO_SINGLE ( 2040) count:75 ave_time:281 total_time:21132 REQUEST_UPDATE_JOB ( 3001) count:21 ave_time:98102 total_time:2060154 REQUEST_UPDATE_PARTITION ( 3005) count:11 ave_time:748 total_time:8231 REQUEST_RESOURCE_ALLOCATION ( 4001) count:11 ave_time:470757 total_time:5178336 REQUEST_BATCH_SCRIPT ( 2051) count:10 ave_time:73539 total_time:735394 REQUEST_JOB_READY ( 4019) count:9 ave_time:305 total_time:2753 REQUEST_RESERVATION_INFO ( 2024) count:9 ave_time:324 total_time:2918 REQUEST_UPDATE_NODE ( 3002) count:8 ave_time:90478 total_time:723827 REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:4 ave_time:1225 total_time:4901 REQUEST_SHARE_INFO ( 2022) count:4 ave_time:45136 total_time:180547 REQUEST_CREATE_RESERVATION ( 3006) count:3 ave_time:3740 total_time:11220 REQUEST_UPDATE_RESERVATION ( 3009) count:3 ave_time:4174 total_time:12523 REQUEST_DELETE_RESERVATION ( 3008) count:3 ave_time:507 total_time:1523 REQUEST_JOB_WILL_RUN ( 4012) count:2 ave_time:1337100 total_time:2674200 REQUEST_JOB_STEP_INFO ( 2005) count:1 ave_time:374 total_time:374 REQUEST_STATS_INFO ( 2035) count:1 ave_time:280 total_time:280 REQUEST_TRIGGER_PULL ( 2030) count:1 ave_time:625 total_time:625 Remote Procedure Call statistics by user root ( 0) count:2453662 ave_time:215740 total_time:529354938803 covenant07 ( 15246) count:2059328 ave_time:112083 total_time:230816339235 slurm ( 280) count:1976634 ave_time:25383 total_time:50174519509 zshang ( 17246) count:182449 ave_time:6807 total_time:1241946262 shaowen.mao ( 19650) count:162364 ave_time:18302 total_time:2971624851 cpattison ( 17344) count:101999 ave_time:19912 total_time:2031065227 parisamir ( 19240) count:95721 ave_time:292960 total_time:28042501476 francis ( 3112) count:74287 ave_time:19762 total_time:1468067338 djimenez ( 17823) count:37358 ave_time:266408 total_time:9952490492 yangyxjtu ( 18539) count:28396 ave_time:8983 total_time:255107082 rbovio ( 18281) count:22860 ave_time:8506720 total_time:194463621227 jmq811 ( 16240) count:21898 ave_time:70610 total_time:1546234039 stewart1983 ( 15971) count:15691 ave_time:45062 total_time:707071202 anjanatalapatra ( 14172) count:10720 ave_time:65625 total_time:703504099 Thanks. Yang Liu, Ph.D. Associate Research Scientist High Performance Research Computing Division of Research Texas A&M University