Are you using pam_limits.so in any of your /etc/pam.d/ configuration files? That would be enforcing /etc/security/limits.conf for all users which are usually unlimited for root. Root’s almost always allowed to do stuff bad enough to crash the machine or run it out of resources. If the /etc/pam.d/sshd file has pam_limits.so in it, that’s probably where the unlimited setting for root is coming from.
Best, Bill. -- Bill Barth, Ph.D., Director, HPC bba...@tacc.utexas.edu | Phone: (512) 232-7069 Office: ROC 1.435 | Fax: (512) 475-9445 On 4/15/18, 1:26 PM, "slurm-users on behalf of Mahmood Naderan" <slurm-users-boun...@lists.schedmd.com on behalf of mahmood...@gmail.com> wrote: I actually have disabled the swap partition (!) since the system goes really bad and based on my experience I have to enter the room and reset the affected machine (!). Otherwise I have to wait for long times to see it get back to normal. When I ssh to the node with root user, the ulimit -a says unlimited virtual memory. So, it seems that the root have unlimited value while users have limited value. Regards, Mahmood On Sun, Apr 15, 2018 at 10:26 PM, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> wrote: > Hi Mahmood, > > It seems your compute node is configured with this limit: > > virtual memory (kbytes, -v) 72089600 > > So when the batch job tries to set a higher limit (ulimit -v 82089600) than > permitted by the system (72089600), this must surely get rejected, as you > have discovered! > > You may want to reconfigure your compute nodes' limits, for example by > setting the virtual memory limit to "unlimited" in your configuration. If > the nodes has a very small RAM memory + swap space size, you might encounter > Out Of Memory errors... > > /Ole