[slurm-users] Call for Abstracts - Slurm User Group Meeting 2018

2018-07-02 Thread Jacob Jenson
The call for abstracts for the 2018 Slurm User Group meeting has been extended until Friday July 6, 2018. You are invited to submit an abstract of a tutorial, technical presentation or site report to be given at the Slurm User Group Meeting 2018. This event is sponsored and organized by CIEMAT an

Re: [slurm-users] All user's jobs killed at the same time on all nodes

2018-07-02 Thread Matteo Guglielmi
I've reported everything back to the actual sysadmin of the cluster... and the truth behind this story is as unbelievable as the story itself. savvy cluster user asked "what is linux?" kind of user to submit 'his' watchdog script to improve the cluster load. Basically you get the f. out of

Re: [slurm-users] All user's jobs killed at the same time on all nodes

2018-07-02 Thread John Hearns
A great detective story! > June15 but there is no trace of it anywhere on the disk. Do you have the process ID (pid) of the watchdog.sh You could look in /proc/(pid) /cmdline and see what that shows On 2 July 2018 at 11:37, Matteo Guglielmi wrote: > Unbelievable... and got it by chance. >

Re: [slurm-users] All user's jobs killed at the same time on all nodes

2018-07-02 Thread Matteo Guglielmi
Unbelievable... and got it by chance. jobs were killed (again) at 21:04 and in the user's list of running processes there was a 'sleep 5' command (13 hours + 53 minutes + 20 seconds) which was fired up exactly at the same time. The watchdog.sh script (from which the sleep command is fired) wa