[slurm-users] Why job memory request may be automatically set by Slurm to RealMemory of some node?

2022-11-04 Thread Taras Shapovalov
Hey,

I noticed a weird behavior of Slurm 21 and 22. When the following conditions 
are satisfied, then Slurm implicitly sets job memory request equal to 
RealMemory of some node (perhaps first node that satisfies other job's 
requests, but this is not documented, or I could not find in the documentation):
 - RealMemory is specified explicitly in slurm.conf for NodeName line,
 - no DefMemPerXXX is specified in slurm.conf,
 - user does not specify memory request,
 - cons_tres plugin is configured.

Should it be set at least to FreeMemory, or even left empty?

Best regards,

Taras


[slurm-users] hierarchies/dependencies between QoSs

2022-11-04 Thread Sebastian Schmutzhard-Höfler

Hi,

is there a way to have hierarchies/dependencies between different QoS's, 
except from preemption?


Is it possible to change the qos of a running job?

We have qos=gpus2, qos=gpus4 and qos=gpus6 (allowing a certain maximal 
total number of gpus for the user). I want that the running/pending 
qos=gpus2 jobs are converted to qos=gpus4 jobs when a qos=gpus4 job is 
submitted, and also, when qos=gpus4 jobs are already running, new 
qos=gpus2 jobs are subimtted automatically as qos=gpus4 jobs.


Thanks,

Sebastian




[slurm-users] Why every job will sleep 100000000

2022-11-04 Thread GHui
I found a sleep process running by root, when I submit a job. And it sleep 
1 seconds.
Sometimes, my job is hung up. The job state is "R". Though it runs nothing, the 
jobscript like the following,
--
#!/bin/bash
#SBATCH -J sub
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -p vpartition

--

Is it because of "sleep 1" process? Or how could I debug it?

Any help will be appreciated.
--GHui

Re: [slurm-users] Why every job will sleep 100000000

2022-11-04 Thread Jeffrey T Frey
If you examine the process hierarchy, that "sleep 1" process if 
probably the child of a "slurmstepd: [.extern]" process.  This is a 
housekeeping step launched for the job by slurmd -- in older Slurm releases it 
would handle the X11 forwarding, for example.  It should have no impact on the 
other steps of the job.




> On Nov 4, 2022, at 05:26 , GHui  wrote:
> 
> I found a sleep process running by root, when I submit a job. And it sleep 
> 1 seconds.
> Sometimes, my job is hung up. The job state is "R". Though it runs nothing, 
> the jobscript like the following,
> --
> #!/bin/bash
> #SBATCH -J sub
> #SBATCH -N 1
> #SBATCH -n 1
> #SBATCH -p vpartition
> 
> --
> 
> Is it because of "sleep 1" process? Or how could I debug it?
> 
> Any help will be appreciated.
> --GHui