Re: [slurm-users] What is the 2^32-1 values in "stepd_connect to .4294967295 failed" telling you

2019-03-08 Thread Christopher Samuel
On 3/8/19 12:25 AM, Kevin Buckley wrote: error: stepd_connect to .1 failed: No such file or directory error: stepd_connect to .4294967295 failed: No such file or directory We can imagine why a job that got killed in step 0 might still be looking for the .1 step but the .2^32-1 is beyond our i

Re: [slurm-users] How to list available CPUs/GPUs for jobs

2019-03-08 Thread Ole Holm Nielsen
On 3/8/19 1:59 PM, Frava wrote: I'm replying to the "[slurm-users] Available gpus ?" post. Some time ago I did a BASHv4 script in for listing the available CPU/RAM/GPU on the nodes. It parses the output of the "scontrol -o -d show node" command and displays what I think is needed to launch GPU

[slurm-users] How to list available CPUs/GPUs for jobs

2019-03-08 Thread Frava
Hi all, I'm replying to the "[slurm-users] Available gpus ?" post. Some time ago I did a BASHv4 script in for listing the available CPU/RAM/GPU on the nodes. It parses the output of the "scontrol -o -d show node" command and displays what I think is needed to launch GPU jobs. For now the script do

Re: [slurm-users] Source of SIGTERM

2019-03-08 Thread Marcus Wagner
Hi Doug, you could try to use auditd to catch the source. When we used LSF in earlier times, we had an issue with one of our prolog scripts, which killed jobs, when a job of the same user was already on the node. auditd helped at that point to identify our own nodecleaner script ;) Best Marc

[slurm-users] What is the 2^32-1 values in "stepd_connect to .4294967295 failed" telling you

2019-03-08 Thread Kevin Buckley
We have some SLURM jobs for which the .0 process is being killed by the OS's oom-kill, which makes SLURM come over all wobbly! After messages akin to : [.extern] _oom_event_monitor: oom-kill event count: 1 [.0] done with job what we then see in the slurmd logs are messages of the form: error