Hi, Yair, Out of curiosity have you checked to see if this is a runaway job?
David On Tue, Mar 31, 2020 at 7:49 AM Yair Yarom <ir...@cs.huji.ac.il> wrote: > Hi, > > We have an issue where running srun (with --pty zsh), and rebooting the > node (from a different shell), the srun reports: > srun: error: eio_message_socket_accept: slurm_receive_msg[an.ip.addr.ess]: > Zero Bytes were transmitted or received > and hangs. > > After the node boots, the slurm claims that job is still RUNNING, and srun > is still alive (but not responsive). > > I've tried it with various configurations (select/linear, > select/cons_tres, jobacct_gather/linux, jobacct_gather/cgroup, task/none, > task/cgroup), with the same results. We're using 19.05.1. > Running with sbatch causes the job to be in the more appropriate NODE_FAIL > state instead. > > Anyone else encountered this? or know how to make the job state not > RUNNING after it's clearly not running? > > Thanks in advance, > Yair. > > -- David Rhey --------------- Advanced Research Computing - Technology Services University of Michigan