Hi, We have an issue where running srun (with --pty zsh), and rebooting the node (from a different shell), the srun reports: srun: error: eio_message_socket_accept: slurm_receive_msg[an.ip.addr.ess]: Zero Bytes were transmitted or received and hangs.
After the node boots, the slurm claims that job is still RUNNING, and srun is still alive (but not responsive). I've tried it with various configurations (select/linear, select/cons_tres, jobacct_gather/linux, jobacct_gather/cgroup, task/none, task/cgroup), with the same results. We're using 19.05.1. Running with sbatch causes the job to be in the more appropriate NODE_FAIL state instead. Anyone else encountered this? or know how to make the job state not RUNNING after it's clearly not running? Thanks in advance, Yair.