I've checked it now, it isn't listed as a runaway job. On Tue, Mar 31, 2020 at 5:24 PM David Rhey <dr...@umich.edu> wrote:
> Hi, Yair, > > Out of curiosity have you checked to see if this is a runaway job? > > David > > On Tue, Mar 31, 2020 at 7:49 AM Yair Yarom <ir...@cs.huji.ac.il> wrote: > >> Hi, >> >> We have an issue where running srun (with --pty zsh), and rebooting the >> node (from a different shell), the srun reports: >> srun: error: eio_message_socket_accept: >> slurm_receive_msg[an.ip.addr.ess]: Zero Bytes were transmitted or received >> and hangs. >> >> After the node boots, the slurm claims that job is still RUNNING, and >> srun is still alive (but not responsive). >> >> I've tried it with various configurations (select/linear, >> select/cons_tres, jobacct_gather/linux, jobacct_gather/cgroup, task/none, >> task/cgroup), with the same results. We're using 19.05.1. >> Running with sbatch causes the job to be in the more appropriate >> NODE_FAIL state instead. >> >> Anyone else encountered this? or know how to make the job state not >> RUNNING after it's clearly not running? >> >> Thanks in advance, >> Yair. >> >> > > -- > David Rhey > --------------- > Advanced Research Computing - Technology Services > University of Michigan >