On 3/8/19 12:25 AM, Kevin Buckley wrote:
error: stepd_connect to .1 failed: No such file or directory
error: stepd_connect to .4294967295 failed: No such file or
directory
We can imagine why a job that got killed in step 0 might still be looking
for the .1 step but the .2^32-1 is beyond our i
On 3/8/19 1:59 PM, Frava wrote:
I'm replying to the "[slurm-users] Available gpus ?" post. Some time ago
I did a BASHv4 script in for listing the available CPU/RAM/GPU on the
nodes. It parses the output of the "scontrol -o -d show node" command
and displays what I think is needed to launch GPU
Hi all,
I'm replying to the "[slurm-users] Available gpus ?" post. Some time ago I
did a BASHv4 script in for listing the available CPU/RAM/GPU on the nodes.
It parses the output of the "scontrol -o -d show node" command and displays
what I think is needed to launch GPU jobs.
For now the script do
Hi Doug,
you could try to use auditd to catch the source.
When we used LSF in earlier times, we had an issue with one of our
prolog scripts, which killed jobs, when a job of the same user was
already on the node. auditd helped at that point to identify our own
nodecleaner script ;)
Best
Marc
We have some SLURM jobs for which the .0 process is being killed
by the OS's oom-kill, which makes SLURM come over all wobbly!
After messages akin to :
[.extern] _oom_event_monitor: oom-kill event count: 1
[.0] done with job
what we then see in the slurmd logs are messages of the form:
error