Re: [slurm-users] Slurm Jobscript Archiver

2019-06-14 Thread Christopher Benjamin Coffey
Hi Lech, I'm glad that it is working out well with the modifications you've put in place! Yes, there can be a huge volume of jobscripts out there. That’s a pretty good way of dealing with it! . We've backed up 1.1M jobscripts since its inception 1.5 months ago and aren't too worried yet about t

[slurm-users] Pending (Resources) when nodes are available

2019-06-14 Thread Brian Andrus
All, We have a cluster that is using Azure and nodes are started up as needed. I have encountered an interesting situation where a user did a loop to launch 100 jobs using srun. Simple job to just do an 'id' command for testing. The intention was to have 100 jobs on 100 machines. The partiti

Re: [slurm-users] Rename account or move user from one account to another

2019-06-14 Thread Sam Gallop (NBI)
Hi Christoph, I suspect that the answer to both of these is no. When I tried to modify an account I got ... $ sudo sacctmgr modify account where name=user1 set account=newaccount1 Can't modify the name of an account Also, the sacctmgr can only reset a user's rawusage, as it only supports a va

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-14 Thread Christopher Harrop - NOAA Affiliate
> Hi Chris > > You are right in pointing that the job actually runs, despite of the error in > the sbatch. The customer mention that: > === start === > Problem had usual scenario - job script was submitted and executed, but > sbatch command returned non-zero exit status to ecflow, which thus as

Re: [slurm-users] ConstrainRAMSpace=yes and page cache?

2019-06-14 Thread Sam Gallop (NBI)
Hi Jürgen, I'm not aware of a Slurm-onic way of doing this. As you've said this is the behaviour cgroups, which Slurm is employing. As I understand it, upon allocation the page cache is accounted within the calling process's cgroup, and I'm not aware of way of preventing the memory resource con

Re: [slurm-users] Slurm Jobscript Archiver

2019-06-14 Thread Lech Nieroda
Hello Chris, we’ve tried out your archiver and adapted it to our needs, it works quite well. The changes: - we get lots of jobs per day, ca. 3k-5k, so storing them as individual files would waste too much inodes and 4k-blocks. Instead everything is written into two log files (job_script.log and

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-14 Thread Marcelo Garcia
Hi Chris You are right in pointing that the job actually runs, despite of the error in the sbatch. The customer mention that: === start === Problem had usual scenario - job script was submitted and executed, but sbatch command returned non-zero exit status to ecflow, which thus assumed job to b

Re: [slurm-users] ConstrainRAMSpace=yes and page cache?

2019-06-14 Thread Juergen Salk
Dear Kilian, thanks for pointing this out. I should have mentioned that I had already browsed the croups.conf man page up and down but did not find any specific hints on how to achieve the desired behavior. Maybe I am still missing something obvious? Also the kernel cgroups documentation indicate

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-14 Thread Bjørn-Helge Mevik
Christopher Benjamin Coffey writes: > Hi, you may want to look into increasing the sssd cache length on the > nodes, We have thought about that, but it will not solve the problem, only make it less frequent, I think. > and improving the network connectivity to your ldap > directory. That is so