Re: [slurm-users] not allocating the node for job execution even resources are available.

2020-03-31 Thread navin srivastava
In addition to the above problem . oversubscription is NO then according to the document.so in this scenario even if resources are available it is ot accepting the job from other partition. Even i made the same priority for both the partition but it didn't help. Any Suggestion here. Slurm Worklo

Re: [slurm-users] Managing Local Scratch/TmpDisk

2020-03-31 Thread Fulcomer, Samuel
If you use cgroups, tmpfs /tmp and /dev/shm usage is counted against the requested memory for the job. On Tue, Mar 31, 2020 at 4:51 PM Ellestad, Erik wrote: > How are folks managing allocation of local TmpDisk for jobs? > > We see how you define the location of TmpFs in slurm.conf. > > And then

[slurm-users] Managing Local Scratch/TmpDisk

2020-03-31 Thread Ellestad, Erik
How are folks managing allocation of local TmpDisk for jobs? We see how you define the location of TmpFs in slurm.conf. And then how the amount per host is defined via TmpDisk. Then the request for srun/sbatch via --tmp=X However, it appears SLURM only checks the defined TmpDisk amount when al

Re: [slurm-users] Job with srun is still RUNNING after node reboot

2020-03-31 Thread David Rhey
Hi, Yair, Out of curiosity have you checked to see if this is a runaway job? David On Tue, Mar 31, 2020 at 7:49 AM Yair Yarom wrote: > Hi, > > We have an issue where running srun (with --pty zsh), and rebooting the > node (from a different shell), the srun reports: > srun: error: eio_message_s

[slurm-users] Job with srun is still RUNNING after node reboot

2020-03-31 Thread Yair Yarom
Hi, We have an issue where running srun (with --pty zsh), and rebooting the node (from a different shell), the srun reports: srun: error: eio_message_socket_accept: slurm_receive_msg[an.ip.addr.ess]: Zero Bytes were transmitted or received and hangs. After the node boots, the slurm claims that jo

[slurm-users] not allocating the node for job execution even resources are available.

2020-03-31 Thread navin srivastava
Hi , have an issue with the resource allocation. In the environment have partition like below: PartitionName=small_jobs Nodes=Node[17,20] Default=NO MaxTime=INFINITE State=UP Shared=YES Priority=8000 PartitionName=large_jobs Nodes=Node[17,20] Default=NO MaxTime=INFINITE State=UP Shared=YES Pri