[slurm-users] Re: SLURM configuration for LDAP users
Hi Richard, Richard Chang via slurm-users writes: > Job submission works for local users. I was not aware we need to manually > add the LDAP users to the SlurmDB. Does it mean we need to add each and every > user in LDAP to the Slurm database ? We add users to the Slurm DB automatically within the job submit plug-in. We use the GID as the default Slurm account. Cheers, Loris > On 2/4/2024 9:04 PM, Renfro, Michael wrote: > > “An LDAP user can login to the login, slurmctld and compute nodes, but when > they try to submit jobs, slurmctld logs an error about invalid account or > partition for user.” > > > > Since I don’t think it was mentioned below, does a non-LDAP user get the > same error, or does it work by default? > > > > We don’t use LDAP explicitly, but we’ve used sssd with Slurm and Active > Directory for 6.5 years without issue. We’ve always added users to sacctmgr so > that we could track usage by research group or class, so we never used a > default account for all users. > > > > From: Richard Chang via slurm-users > Date: Saturday, February 3, 2024 at 11:41 PM > To: slurm-us...@schedmd.com > Subject: [slurm-users] SLURM configuration for LDAP users > > External Email Warning > > This email originated from outside the university. Please use caution when > opening attachments, clicking links, or responding to requests. > > - > > Hi, > > I am a little new to this, so please pardon my ignorance. > > I have configured slurm in my cluster and it works fine with local users. > But I am not able to get it working with LDAP/SSSD authentication. > > User logins using ssh are working fine. An LDAP user can login to the login, > slurmctld and compute nodes, but when they try to submit jobs, slurmctld > logs an error about invalid account or partition for user. > > Someone said we need to add the user manually into the database using the > sacctmgr command. But I am not sure we need to do this for each and every > LDAP user. Yes, it does work if we add the LDAP user manually using > sacctmgr. But I am not convinced this manual way is the way to do. > > The documentation is not very clear about using LDAP accounts. > > Saw somewhere in the list about using UsePAM=1 and copying or creating a > softlink for slurm PAM module under /etc/pam.d . But it didn't work for me. > > Saw somewhere else that we need to specifying > LaunchParameters=enable_nss_slurm in the slurm.conf file and put slurm > keyword in passwd/group > entry in the /etc/nsswitch.conf file. Did these, but didn't help either. > > I am bereft of ideas at present. If anyone has real world experience and can > advise, I will be grateful. > > Thank you, > > Richard -- Dr. Loris Bennett (Herr/Mr) FUB-IT (ex-ZEDAT), Freie Universität Berlin -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Starting a job after a file is created in previous job (dependency looking for soluton)
Hi Ajad, Amjad Syed via slurm-users writes: > Hello > > I have the following scenario: > I need to submit a sequence of up to 400 jobs where the even jobs depend on > the preceeding odd job to finish and every odd job depends on the presence of > a > file generated by the preceding even job (availability of the file for the > first of those 400 jobs is guaranteed). > > If I just submit all those jobs via a loop using dependencies, then I end up > with a lot of pending jobs who might later not even run because no output > file has > been produced by the preceding jobs. Is there a way to pause the submission > loop until the required file has been generated so that at most two jobs are > submitted at the same time? > > Here is a sample submission script showing what I want to achieve. > > for i in {1..200}; do > FILE=GHM_paramset_${i}.dat ># How can I pause the submission loop until the FILE has been created > #if test -f "$FILE"; then > jobid4=$(sbatch --parsable --dependency=afterok:$jobid3 job4_sub $i) > jobid3=$(sbatch --parsable --dependency=afterok:$jobid4 job3_sub $i) > #fi > done > > Any help will be appreciated > > Amjad You might find a job array useful for this (for any large number of jobs with identical resources, using a job array also helps backfilling to work efficiently, if you are using it). With a job array you can specify how many jobs should run simultaneously with the '%' notation: --array=1-200%2 Cheers, Loris -- Dr. Loris Bennett (Herr/Mr) FUB-IT (ex-ZEDAT), Freie Universität Berlin -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] job_submit.lua - uid in Docker cluster
Hi, Having used https://github.com/giovtorres/slurm-docker-cluster successfully a couple of years ago to develop a job_submit.lua plugin, I am trying to do this again. However, the plugin which works on our current cluster (CentOS 7.9, Slurm 23.02.7) fails in the Docker cluster (Rocky 8.9, Slurm 23.02.7) with error: job_submit/lua: /etc/slurm/job_submit.lua: /usr/share/lua/5.3/posix/deprecated.lua:453: bad argument #1 to 'getpwuid' (int expected, got number) which occurs when submit_user = posix.getpasswd(submit_user.uid) The problems is that 'submit_user.uid' has the value 0.0 and is thus not an integer. The only user within the Docker cluster is 'root'. Has anyone come across this issue? Is it to do with the Docker environment or the difference in the OS versions (Lua 5.1.4 vs. 5.3.4, lua-posix 32 vs. 33.3.1)? Cheers, Loris -- Dr. Loris Bennett (Herr/Mr) FUB-IT (ex-ZEDAT), Freie Universität Berlin -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Suggestions for Partition/QoS configuration
Hi Thomas, "thomas.hartmann--- via slurm-users" writes: > Hi, > we're testing possible slurm configurations on a test system right now. > Eventually, it is going to serve ~1000 users. > > We're going to have some users who are going to run lots of short jobs > (a couple of minutes to ~4h) and some users that run jobs that are > going to run for days or weeks. I want to avoid a situation in which a > group of users basically saturates the whole cluster with jobs that > run for a week or two and nobody could run any short jobs anymore. I > also would like to favor short jobs, because they make the whole > cluster feel more dynamic and agile for everybody. > > On the other hand, I would like to make the most of the ressources, > i.e. when nobody is sending short jobs, long jobs could run on all the > nodes. > > My idea was to basically have three partitions: > > 1. PartitionName=short MaxTime=04:00:00 State=UP Nodes=node[01-99] > PriorityTier=100 > 2. PartitionName=long_safe MaxTime=14-00:00:00 State=UP Nodes=node[01-50] > PriorityTier=100 > 3. PartitionName=long_preempt MaxTime=14-00:00:00 State=UP Nodes=nodes[01-99] > PriorityTier=40 PreemptMode=requeue > > and then use the JobSubmitPlugin "all_partitions" so that all jobs get > submitted to all partitions by default. This way, a short job ends up > in the `short` partition and is able to use all nodes. A long job ends > up using the `long_safe` partition until for the first 50 nodes. These > jobs are not going to be preempted. Remaining long jobs use the > `long_preempt` queue. So they run on the remaining nodes as long as > there are no higher prio short (or long) jobs in the queue. > > So, the cluster could be saturated with long running jobs but if short > jobs are submitted and the user has a high enough fair share, some of > the long jobs would get preempted and the short ones would run. > > This scenario works fine BUT the long jobs seem to be playing > pingpong on the `long_preempt` partition because as soon as they run, > they stop accruing AGE priority unlike still queued jobs. As soon as a > queued job, albeit by the same user, "overtakes" a running one, it > preempts the running one, stops accruing age and so on > > So, is there maybe a cleverer way to do this? > > Thanks a lot! > Thomas I have never really understood the approach of having different partitions for different lengths of job, but it seems to be quite widespread, so I assume there are valid use cases. However, for our around 450 users, of which about 200 will submit at least one job in a given month, we have an alternative approach without pre-emption where we essentially have just a single partition. Users can then specify a QOS which will increase priority at the cost of accepting a lower cap on number of jobs/resources/maximum runtime: $ sqos Name Priority MaxWall MaxJobs MaxSubmitMaxTRESPU -- -- --- --- - hiprio 1003:00:00 50 100 cpu=128,gres/gpu=4 prio 1000 3-00:00:00 500 1000 cpu=256,gres/gpu=8 standard 0 14-00:00:002000 1 cpu=768,gres/gpu=16 where alias sqos='sacctmgr show qos format=name,priority,maxwall,maxjobs,maxsubmitjobs,maxtrespu%20' /usr/bin/sacctmgr The standard cap on the resources corresponds to about 1/7 of our cores. The downside is that very occasionally nodes may idle because a user has reached his or her cap. However, we have usually have enough uncapped users submitting jobs, so that in fact this happens only rarely, such as sometimes at Christmas or New Year. Cheers, Loris -- Dr. Loris Bennett (Herr/Mr) FUB-IT (ex-ZEDAT), Freie Universität Berlin -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Avoiding fragmentation
Hi Gerhard, Gerhard Strangar via slurm-users writes: > Hi, > > I'm trying to figure out how to deal with a mix of few- and many-cpu > jobs. By that I mean most jobs use 128 cpus, but sometimes there are > jobs with only 16. As soon as that job with only 16 is running, the > scheduler splits the next 128 cpu jobs into 96+16 each, instead of > assigning a full 128 cpu node to them. Is there a way for the > administrator to achieve preferring full nodes? > The existence of pack_serial_at_end makes me believe there is not, > because that basically is what I needed, apart from my serial jobs using > 16 cpus instead of 1. > > Gerhard This may well not be relevant for your case, but we actively discourage the use of full nodes for the following reasons: - When the cluster is full, which is most of the time, MPI jobs in general will start much faster if they don't specify the number of nodes and certainly don't request full nodes. The overhead due to the jobs being scattered across nodes is often much lower than the additional waiting time incurred by requesting whole nodes. - When all the cores of a node are requested, all the memory of the node becomes unavailable to other jobs, regardless of how much memory is requested or indeed how much is actually used. This holds up jobs with low CPU but high memory requirements and thus reduces the total throughput of the system. These factors are important for us because we have a large number of single core jobs and almost all the users, whether doing MPI or not, significantly overestimate the memory requirements of their jobs. Cheers, Loris -- Dr. Loris Bennett (Herr/Mr) FUB-IT (ex-ZEDAT), Freie Universität Berlin -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: scheduling according time requirements
Hi Dietmar, Dietmar Rieder via slurm-users writes: > Hi, > > is it possible to have slurm scheduling jobs automatical according to > the "-t" time requirements to a fitting partition? > > e.g. 3 partitions > > PartitionName=standard Nodes=c-[01-10] Default=YES MaxTime=04:00:00 > DefaultTime=00:10:00 State=UP OverSubscribe=NO > PartitionName=medium Nodes=c-[04-08] Default=NO MaxTime=24:00:00 > DefaultTime=04:00:00 State=UP OverSubscribe=NO > PartitionName=long Nodes=c-[09-10] Default=NO MaxTime=336:00:00 > DefaultTime=24:00:00 State=UP OverSubscribe=NO > > > So in the standard partition which is the default we have all nodes > and a max time of 4h, in the medium partition we have 4 nodes with a > max time of 24h and in the long partition we have 2 nodes with a max > time of 336h. > > I was hoping that if I submit a job with -t 01:00:00 it can be run on > any node (standard partition), whereas when specifying -t 05:00:00 or > -t 48:00:00 the job will run on the nodes of the medium or long > partition respectively. > > However, my job will not get scheduled at all when -t is greater than > 01:00:00 > > i.e. > > ]$ srun --cpus-per-task 1 -t 01:00:01 --pty bash > srun: Requested partition configuration not available now > srun: job 42095 queued and waiting for resources > > it will wait forever because the standard partition is selected, I was > thinking that slurm would automatically switch to the medium > partition. > > Do I misunderstand something there? Or can this be somehow configured. You can specify multiple partitions, e.g. $ salloc --cpus-per-task=1 --time=01:00:01 --partition=standard,medium,long Notice that rather than using 'srun ... --pty bash', as far as I understand, the preferred method is to use 'salloc' as above, and to use 'srun' for starting MPI processes. Cheers, Loris > Thanks so much and sorry for the naive question >Dietmar -- Dr. Loris Bennett (Herr/Mr) FUB-IT (ex-ZEDAT), Freie Universität Berlin -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: [EXTERN] Re: scheduling according time requirements
Hi Dietmar, Dietmar Rieder via slurm-users writes: > Hi Loris, > > On 4/30/24 2:53 PM, Loris Bennett via slurm-users wrote: >> Hi Dietmar, >> Dietmar Rieder via slurm-users >> writes: >> >>> Hi, >>> >>> is it possible to have slurm scheduling jobs automatical according to >>> the "-t" time requirements to a fitting partition? >>> >>> e.g. 3 partitions >>> >>> PartitionName=standard Nodes=c-[01-10] Default=YES MaxTime=04:00:00 >>> DefaultTime=00:10:00 State=UP OverSubscribe=NO >>> PartitionName=medium Nodes=c-[04-08] Default=NO MaxTime=24:00:00 >>> DefaultTime=04:00:00 State=UP OverSubscribe=NO >>> PartitionName=long Nodes=c-[09-10] Default=NO MaxTime=336:00:00 >>> DefaultTime=24:00:00 State=UP OverSubscribe=NO >>> >>> >>> So in the standard partition which is the default we have all nodes >>> and a max time of 4h, in the medium partition we have 4 nodes with a >>> max time of 24h and in the long partition we have 2 nodes with a max >>> time of 336h. >>> >>> I was hoping that if I submit a job with -t 01:00:00 it can be run on >>> any node (standard partition), whereas when specifying -t 05:00:00 or >>> -t 48:00:00 the job will run on the nodes of the medium or long >>> partition respectively. >>> >>> However, my job will not get scheduled at all when -t is greater than >>> 01:00:00 >>> >>> i.e. >>> >>> ]$ srun --cpus-per-task 1 -t 01:00:01 --pty bash >>> srun: Requested partition configuration not available now >>> srun: job 42095 queued and waiting for resources >>> >>> it will wait forever because the standard partition is selected, I was >>> thinking that slurm would automatically switch to the medium >>> partition. >>> >>> Do I misunderstand something there? Or can this be somehow configured. >> You can specify multiple partitions, e.g. >> $ salloc --cpus-per-task=1 --time=01:00:01 >> --partition=standard,medium,long >> Notice that rather than using 'srun ... --pty bash', as far as I >> understand, the preferred method is to use 'salloc' as above, and to use >> 'srun' for starting MPI processes. > > Thanks for the hint. This works nicely, but it would be nice that I > would not need to specify the partition at all. Any thoughts? I am not aware that you can set multiple partition as a default. The question is why you actually need partitions with different maximum runtimes. In our case, a university cluster with a very wide range of codes and usage patterns, multiple partitions would probably lead to fragmentation and wastage of resources due to the job mix not always fitting well to the various partitions. Therefore, I am a member of the "as few partitions as possible" camp and so in our set-up we have as essentially only one partition with a DefaultTime of 14 days. We do however let users set a QOS to gain a priority boost in return for accepting a shorter run-time and a reduced maximum number of cores. Occasionally people complain about short jobs having to wait in the queue for too long, but I have generally been successful in solving the problem by having them estimate their resource requirements better or bundling their work in ordert to increase the run-time-to-wait-time ratio. Cheers, Loris -- Dr. Loris Bennett (Herr/Mr) FUB-IT (ex-ZEDAT), Freie Universität Berlin -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: [EXTERN] Re: scheduling according time requirements
Hi Dietmar, Dietmar Rieder via slurm-users writes: > Hi Loris, > > On 4/30/24 3:43 PM, Loris Bennett via slurm-users wrote: >> Hi Dietmar, >> Dietmar Rieder via slurm-users >> writes: >> >>> Hi Loris, >>> >>> On 4/30/24 2:53 PM, Loris Bennett via slurm-users wrote: >>>> Hi Dietmar, >>>> Dietmar Rieder via slurm-users >>>> writes: >>>> >>>>> Hi, >>>>> >>>>> is it possible to have slurm scheduling jobs automatical according to >>>>> the "-t" time requirements to a fitting partition? >>>>> >>>>> e.g. 3 partitions >>>>> >>>>> PartitionName=standard Nodes=c-[01-10] Default=YES MaxTime=04:00:00 >>>>> DefaultTime=00:10:00 State=UP OverSubscribe=NO >>>>> PartitionName=medium Nodes=c-[04-08] Default=NO MaxTime=24:00:00 >>>>> DefaultTime=04:00:00 State=UP OverSubscribe=NO >>>>> PartitionName=long Nodes=c-[09-10] Default=NO MaxTime=336:00:00 >>>>> DefaultTime=24:00:00 State=UP OverSubscribe=NO >>>>> >>>>> >>>>> So in the standard partition which is the default we have all nodes >>>>> and a max time of 4h, in the medium partition we have 4 nodes with a >>>>> max time of 24h and in the long partition we have 2 nodes with a max >>>>> time of 336h. >>>>> >>>>> I was hoping that if I submit a job with -t 01:00:00 it can be run on >>>>> any node (standard partition), whereas when specifying -t 05:00:00 or >>>>> -t 48:00:00 the job will run on the nodes of the medium or long >>>>> partition respectively. >>>>> >>>>> However, my job will not get scheduled at all when -t is greater than >>>>> 01:00:00 >>>>> >>>>> i.e. >>>>> >>>>> ]$ srun --cpus-per-task 1 -t 01:00:01 --pty bash >>>>> srun: Requested partition configuration not available now >>>>> srun: job 42095 queued and waiting for resources >>>>> >>>>> it will wait forever because the standard partition is selected, I was >>>>> thinking that slurm would automatically switch to the medium >>>>> partition. >>>>> >>>>> Do I misunderstand something there? Or can this be somehow configured. >>>> You can specify multiple partitions, e.g. >>>> $ salloc --cpus-per-task=1 --time=01:00:01 >>>> --partition=standard,medium,long >>>> Notice that rather than using 'srun ... --pty bash', as far as I >>>> understand, the preferred method is to use 'salloc' as above, and to use >>>> 'srun' for starting MPI processes. >>> >>> Thanks for the hint. This works nicely, but it would be nice that I >>> would not need to specify the partition at all. Any thoughts? >> I am not aware that you can set multiple partition as a default. > > Diego suggested a possible way which seems to work after a quick test. Yes, I wasn't aware of that, but it might also be useful for us, too. >> The question is why you actually need partitions with different >> maximum >> runtimes. > > we would like to have only a sub set of the nodes in a partition for > long running jobs, so that there are enough nodes available for short > jobs. > > The nodes for the long partition, however are also part of the short > partition so they can also be utilized when no long jobs are running. > > That's our idea If you have plenty of short running jobs, that is probably a reasonable approach. On our system, the number of short running jobs would probably tend to dip significantly over the weekend and public holidays, so resources would potentially be blocked for the long running jobs. On the other hand, long-running jobs on our system often run for days, so one day here or there might not be so significant. And if the long-running jobs were able to start in the short partition, they could block short jobs. The other thing to think about with regard to short jobs is backfilling. With our mix of jobs, unless a job needs a large amount of memory or number of cores, those with a run-time of only a few hours should be backfilled fairly efficiently. Regards Loris >> In our case, a university cluster with a very wide range of codes >> and >> usage patterns, multiple partitions would probably lead to fragmentation >> and wastage of resources due to the job mix not always fitting well to >> the various partitions. Therefore, I am a member of the "as few >> partitions as possible" camp and so in our set-up we have as essentially >> only one partition with a DefaultTime of 14 days. We do however let >> users set a QOS to gain a priority boost in return for accepting a >> shorter run-time and a reduced maximum number of cores. > > we didn't look into QOS yet, but this might also a way to go, thanks. > >> Occasionally people complain about short jobs having to wait in the >> queue for too long, but I have generally been successful in solving the >> problem by having them estimate their resource requirements better or >> bundling their work in ordert to increase the run-time-to-wait-time >> ratio. >> > > Dietmar -- Dr. Loris Bennett (Herr/Mr) FUB-IT (ex-ZEDAT), Freie Universität Berlin -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: GPU GRES verification and some really broad questions.
Hi, Shooktija S N via slurm-users writes: > Hi, > > I am a complete slurm-admin and sys-admin noob trying to set up a 3 node > Slurm cluster. I have managed to get a minimum working example running, in > which I am able to use a GPU (NVIDIA GeForce RTX 4070 ti) as a GRES. > > This is slurm.conf without the comment lines: > root@server1:/etc/slurm# grep -v "#" slurm.conf > ClusterName=DlabCluster > SlurmctldHost=server1 > GresTypes=gpu > ProctrackType=proctrack/linuxproc > ReturnToService=1 > SlurmctldPidFile=/var/run/slurmctld.pid > SlurmctldPort=6817 > SlurmdPidFile=/var/run/slurmd.pid > SlurmdPort=6818 > SlurmdSpoolDir=/var/spool/slurmd > SlurmUser=root > StateSaveLocation=/var/spool/slurmctld > TaskPlugin=task/affinity,task/cgroup > InactiveLimit=0 > KillWait=30 > MinJobAge=300 > SlurmctldTimeout=120 > SlurmdTimeout=300 > Waittime=0 > SchedulerType=sched/backfill > SelectType=select/cons_tres > JobCompType=jobcomp/none > JobAcctGatherFrequency=30 > SlurmctldDebug=info > SlurmctldLogFile=/var/log/slurmctld.log > SlurmdDebug=debug3 > SlurmdLogFile=/var/log/slurmd.log > NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64 > ThreadsPerCore=2 State=UNKNOWN Gres=gpu:1 > PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE State=UP > This is gres.conf (only one line), each node has been assigned its > corresponding NodeName: > root@server1:/etc/slurm# cat gres.conf > NodeName=server1 Name=gpu File=/dev/nvidia0 > Those are the only config files I have. > > I have a few general questions, loosely arranged in ascending order of > generality: > > 1) I have enabled the allocation of GPU resources as a GRES and have tested > this by running: > shookti@server1:~$ srun --nodes=3 --gpus=3 --label hostname > 2: server3 > 0: server1 > 1: server2 > Is this a good way to check if the configs have worked correctly? How else > can I easily check if the GPU GRES has been properly configured? What do you mean by 'properly configured'? Ultimately you will want to submit a job to the nodes and use something like 'nvidia-smi' to see whether the GPUs are actually being used. > 2) I want to reserve a few CPU cores, and a few gigs of memory for use by non > slurm related tasks. According to the documentation, I am to use > CoreSpecCount and MemSpecLimit to achieve this. The documentation for > CoreSpecCount says "the Slurm daemon slurmd may either be confined to these > resources (the default) or prevented from using these resources", how do I > change this default behaviour to have the config specify the cores reserved > for non > slurm stuff instead of specifying how many cores slurm can use? I am not aware that this is possible. > 3) While looking up examples online on how to run Python scripts inside a > conda env, I have seen that the line 'module load conda' should be run before > running 'conda activate myEnv' in the sbatch submission script. The command > 'module' did not exist until I installed the apt package > 'environment-modules', > but now I see that conda is not listed as a module that can be loaded when I > check using the command 'module avail'. How do I fix this? Environment modules and Conda are somewhat orthogonal to each other. Environment modules is a mechanism for manipulating environment variables such as PATH and LD_LIBRARY_PATH. It allows you to provide easy access for all users to software which has been centrally installed in non-standard paths. It is not used to provide access to software installed via 'apt'. Conda is another approach to providing non-standard software, but is usually used by individual users to install programs in their own home directories. You can use environment modules to allow access to a different version of Conda than the one you get via 'apt', but there is no necessity to do that. > 4) A very broad question: while managing the resources being used by a > program, slurm might happen to split the resources across multiple computers > that > might not necessarily have the files required by this program to run. For > example, a python script that requires the package 'numpy' to function but > that > package was not installed on all of the computers. How are such things dealt > with? Is the module approach meant to fix this problem? In my previous > question, if I had a python script that users usually run just by running a > command like 'python3 someScript.py' instead of running it within a conda > environment, how should I enable slurm to manage the resources required by > this script? Would I have to install all the packages required by this script > on all > the computers that are in the cluster? In general a distributed or cluster file system, such as NFS, Ceph or Lustre is used to provide access to multiple nodes. /home would be on such a files system, as would a large part of the software. You can use something like EasyBuild which will install software and generate the relevant module files. > 5) Related
[slurm-users] Re: diagnosing why interactive/non-interactive job waits are so long with State=MIXED
Ryan Novosielski via slurm-users writes: > We do have bf_continue set. And also bf_max_job_user=50, because we > discovered that one user can submit so many jobs that it will hit the limit > of the number > it’s going to consider and not run some jobs that it could otherwise run. > > On Jun 4, 2024, at 16:20, Robert Kudyba wrote: > > Thanks for the quick response Ryan! > > Are there any recommendations for bf_ options from > https://slurm.schedmd.com/sched_config.html that could help with this? > bf_continue? Decreasing > bf_interval= to a value lower than 30? Your bf_window may be too small. From 'man slurm.conf': bf_window=# The number of minutes into the future to look when considering jobs to schedule. Higher values result in more overhead and less responsiveness. A value at least as long as the highest allowed time limit is generally advisable to prevent job starvation. In order to limit the amount of data managed by the backfill scheduler, if the value of bf_window is increased, then it is generally advisable to also increase bf_resolution. This option applies only to SchedulerType=sched/backfill. Default: 1440 (1 day), Min: 1, Max: 43200 (30 days). > On Tue, Jun 4, 2024 at 4:13 PM Ryan Novosielski wrote: > > This is relatively true of my system as well, and I believe it’s that the > backfill schedule is slower than the main scheduler. > > -- > #BlackLivesMatter > > || \\UTGERS, |---*O*--- > ||_// the State | Ryan Novosielski - novos...@rutgers.edu > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus > || \\of NJ | Office of Advanced Research Computing - MSB A555B, Newark > `' > > On Jun 4, 2024, at 16:03, Robert Kudyba via slurm-users > wrote: > > At the moment we have 2 nodes that are having long wait times. Generally > this is when the nodes are fully allocated. What would be the other > reasons if there is still enough available memory and CPU available, that a > job would take so long? Slurm version is 23.02.4 via Bright > Computing. Note the compute nodes have hyperthreading enabled but that > should be irrelevant. Is there a way to determine what else could > be holding jobs up? > > srun --pty -t 0-01:00:00 --nodelist=node001 --gres=gpu:1 -A ourts -p short > /bin/bash > srun: job 672204 queued and waiting for resources > > scontrol show node node001 > NodeName=m001 Arch=x86_64 CoresPerSocket=48 > CPUAlloc=24 CPUEfctv=192 CPUTot=192 CPULoad=20.37 > AvailableFeatures=location=local > ActiveFeatures=location=local > Gres=gpu:A6000:8 > NodeAddr=node001 NodeHostName=node001 Version=23.02.4 > OS=Linux 5.14.0-70.13.1.el9_0.x86_64 #1 SMP PREEMPT Thu Apr 14 12:42:38 > EDT 2022 > RealMemory=1031883 AllocMem=1028096 FreeMem=222528 Sockets=2 Boards=1 > State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A > Partitions=ours,short > BootTime=2024-04-29T16:18:30 SlurmdStartTime=2024-05-18T16:48:11 > LastBusyTime=2024-06-03T10:49:49 ResumeAfterTime=None > CfgTRES=cpu=192,mem=1031883M,billing=192,gres/gpu=8 > AllocTRES=cpu=24,mem=1004G,gres/gpu=2,gres/gpu:a6000=2 > CapWatts=n/a > CurrentWatts=0 AveWatts=0 > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > grep 672204 /var/log/slurmctld > [2024-06-04T15:50:35.627] sched: _slurm_rpc_allocate_resources JobId=672204 > NodeList=(null) usec=852 > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com -- Dr. Loris Bennett (Herr/Mr) FUB-IT (ex-ZEDAT), Freie Universität Berlin -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: sbatch: Node count specification invalid - when only specifying --ntasks
Hi George, George Leaver via slurm-users writes: > Hello, > > Previously we were running 22.05.10 and could submit a "multinode" job > using only the total number of cores to run, not the number of nodes. > For example, in a cluster containing only 40-core nodes (no > hyperthreading), Slurm would determine two nodes were needed with > only: > sbatch -p multinode -n 80 --wrap="" > > Now in 23.02.1 this is no longer the case - we get: > sbatch: error: Batch job submission failed: Node count specification invalid > > At least -N 2 is must be used (-n 80 can be added) > sbatch -p multinode -N 2 -n 80 --wrap="" > > The partition config was, and is, as follows (MinNodes=2 to reject > small jobs submitted to this partition - we want at least two nodes > requested) > PartitionName=multinode State=UP Nodes=node[081-245] > DefaultTime=168:00:00 MaxTime=168:00:00 PreemptMode=OFF PriorityTier=1 > DefMemPerCPU=4096 MinNodes=2 QOS=multinode Oversubscribe=EXCLUSIVE > Default=NO But do you really want to force a job to use two nodes if it could in fact run on one? What is the use-case for having separate 'uninode' and 'multinode' partitions? We have a university cluster with a very wide range of jobs and essentially a single partition. Allowing all job types to use one partition means that the different resource requirements tend to complement each other to some degree. Doesn't splitting up your jobs over two partitions mean that either one of the two partitions could be full, while the other has idle nodes? Cheers, Loris > All nodes are of the form > NodeName=node245 NodeAddr=node245 State=UNKNOWN Procs=40 Sockets=2 > CoresPerSocket=20 ThreadsPerCore=1 RealMemory=187000 > > slurm.conf has > EnforcePartLimits = ANY > SelectType = select/cons_tres > TaskPlugin = task/cgroup,task/affinity > > A few fields from: sacctmgr show qos multinode > Name|Flags|MaxTRES > multinode|DenyOnLimit|node=5 > > The sbatch/srun man page states: > -n, --ntasks If -N is not specified, the default behavior is to > allocate enough nodes to satisfy the requested resources as expressed > by per-job specification options, e.g. -n, -c and --gpus. > > I've had a look through release notes back to 22.05.10 but can't see anything > obvious (to me). > > Has this behaviour changed? Or, more likely, what have I missed ;-) ? > > Many thanks, > George > > -- > George Leaver > Research Infrastructure, IT Services, University of Manchester > http://ri.itservices.manchester.ac.uk | @UoM_eResearch -- Dr. Loris Bennett (Herr/Mr) FUB-IT (ex-ZEDAT), Freie Universität Berlin -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: sbatch: Node count specification invalid - when only specifying --ntasks
Hi George, George Leaver via slurm-users writes: > Hi Loris, > >> Doesn't splitting up your jobs over two partitions mean that either >> one of the two partitions could be full, while the other has idle >> nodes? > > Yes, potentially, and we may move away from our current config at some > point (it's a bit of a hangover from an SGE cluster.) Hasn't really > been an issue at the moment. > > Do you find fragmentation a problem? Or do you just let the bf scheduler > handle that (assuming jobs have a realistic wallclock request?) Well, not with essentially only one partition we don't have fragmentation related to that. We did used to have multiple partitions for different run-times, we did have fragmentation. However, I couldn't see any advantage in that setup, so we moved to one partition and various QOS to handle say test or debug jobs. However, users do still sometimes add potentially arbitrary conditions to their jobs script, such as the number of nodes for MPI jobs. Whereas in principal it may be a good idea to reduce the MPI-overhead by reducing the number of nodes, in practice any such advantage may well be cancelled out or exceeded by the extra time the job is going to have to wait for those specific resources. Backfill works fairly well for us, although indeed not without a little badgering of users to get them to specify appropriate run-times. > But for now, would be handy if users didn't need to adjust their jobscripts > (or we didn't need to write a submit script.) If you ditch one of the partitions, you could always use a job submit plug-in to replace the invalid partition specified by the job by the remaining partition. Cheers, Loris > Regards, > George > > -- > George Leaver > Research Infrastructure, IT Services, University of Manchester > http://ri.itservices.manchester.ac.uk | @UoM_eResearch > > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com -- Dr. Loris Bennett (Herr/Mr) FUB-IT (ex-ZEDAT), Freie Universität Berlin -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: How to exclude master from computing? Set to DRAINED?
Hi Xaver, Xaver Stiensmeier via slurm-users writes: > Dear Slurm users, > > in our project we exclude the master from computing before starting > Slurmctld. We used to exclude the master from computing by simply not > mentioning it in > the configuration i.e. just not having: > > PartitionName=SomePartition Nodes=master > > or something similar. Apparently, this is not the way to do this as it is now > a fatal error > > fatal: Unable to determine this slurmd's NodeName > > therefore, my question: > > What is the best practice for excluding the master node from work? > > I personally primarily see the option to set the node into DOWN, DRAINED or > RESERVED. Since we use ReturnToService=2, I guess DOWN is not the way to go. > RESERVED fits with the second part "The node is in an advanced reservation > and not generally available." and DRAINED "The node is unavailable for use per > system administrator request." fits completely. So is DRAINED the correct > setting in such a case? You just don't configure the head node in any partition. You are getting the error because you are starting 'slurmd' on the node, which implies you do want to run jobs there. Normally you would run only 'slurmctld' and possibly also 'slurmdbd' on your head node. Cheers, Loris -- Dr. Loris Bennett (Herr/Mr) FUB-IT (ex-ZEDAT), Freie Universität Berlin -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Unable to run sequential jobs simultaneously on the same node
Dear Arko, Arko Roy via slurm-users writes: > I want to run 50 sequential jobs (essentially 50 copies of the same code with > different input parameters) on a particular node. However, as soon as one of > the > jobs gets executed, the other 49 jobs get killed immediately with exit code > 9. The jobs are not interacting and are strictly parallel. However, if the > 50 jobs run on > 50 different nodes, it runs successfully. > Can anyone please help with possible fixes? > I see a discussion almost along the similar lines in > https://groups.google.com/g/slurm-users/c/I1T6GWcLjt4 > But could not get the final solution. If the jobs are independent, why do you want to run them all on the same node? If you do have problems when jobs run on the same node, there may be an issue with the jobs all trying to access a single resource, such as a file. However, you probably need to show your job script in order for anyone to be able to work out what is going on. Regards Loris > -- > Arko Roy > Assistant Professor > School of Physical Sciences > Indian Institute of Technology Mandi > Kamand, Mandi > Himachal Pradesh - 175 005, India > Email: a...@iitmandi.ac.in > Web: https://faculty.iitmandi.ac.in/~arko/ -- Dr. Loris Bennett (Herr/Mr) FUB-IT, Freie Universität Berlin -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Unable to run sequential jobs simultaneously on the same node
Dear Arko, Arko Roy writes: > Thanks Loris and Gareth. here is the job submission script. if you find any > errors please let me know. > since i am not the admin but just an user, i think i dont have access to the > prolog and epilogue files. > > If the jobs are independent, why do you want to run them all on the same > node? > I am running sequential codes. Essentially 50 copies of the same node with a > variation in parameter. > Since I am using the Slurm scheduler, the nodes and cores are allocated > depending upon the > available resources. So there are instances, when 20 of them goes to 20 free > cores located on a particular > node and the rest 30 goes to the free 30 cores on another node. It turns out > that only 1 job out of 20 and 1 job > out of 30 are completed succesfully with exitcode 0 and the rest gets > terminated with exitcode 9. > for information, i run sjobexitmod -l jobid to check the exitcodes. > > -- > the submission script is as follows: > > #!/bin/bash > > # Setting slurm options > > > # lines starting with "#SBATCH" define your jobs parameters > # requesting the type of node on which to run job > ##SBATCH --partition > #SBATCH --partition=standard > > # telling slurm how many instances of this job to spawn (typically 1) > > ##SBATCH --ntasks > ##SBATCH --ntasks=1 > #SBATCH --nodes=1 > ##SBATCH -N 1 > ##SBATCH --ntasks-per-node=1 > > # setting number of CPUs per task (1 for serial jobs) > > ##SBATCH --cpus-per-task > > ##SBATCH --cpus-per-task=1 > > # setting memory requirements > > ##SBATCH --mem-per-cpu > #SBATCH --mem-per-cpu=1G > > # propagating max time for job to run > > ##SBATCH --time > ##SBATCH --time > ##SBATCH --time > #SBATCH --time 10:0:0 > #SBATCH --job-name gstate > > #module load compiler/intel/2018_4 > module load fftw-3.3.10-intel-2021.6.0-ppbepka > echo "Running on $(hostname)" > echo "We are in $(pwd)" > > > # run the program > > /home/arkoroy.sps.iitmandi/ferro-detun/input1/a_1.out & You should not write & at the end of the above command. This will run your program in the background, which will cause the submit script to terminate, which in turn will terminate your job. Regards Loris -- Dr. Loris Bennett (Herr/Mr) FUB-IT, Freie Universität Berlin -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] salloc not starting shell despite LaunchParameters=use_interactive_step
Hi, With $ salloc --version slurm 23.11.10 and $ grep LaunchParameters /etc/slurm/slurm.conf LaunchParameters=use_interactive_step the following $ salloc --partition=interactive --ntasks=1 --time=00:03:00 --mem=1000 --qos=standard salloc: Granted job allocation 18928869 salloc: Nodes c001 are ready for job creates a job $ squeue --me JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 18928779 interacti interactloris R 1:05 1 c001 but causes the terminal to block. From a second terminal I can log into the compute node: $ ssh c001 [13:39:36] loris@c001 (1000) ~ Is that the expected behaviour or should salloc return a shell directly on the compute node (like srun --pty /bin/bash -l used to do)? Cheers, Loris -- Dr. Loris Bennett (Herr/Mr) FUB-IT, Freie Universität Berlin -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: salloc not starting shell despite LaunchParameters=use_interactive_step
Jason Simms via slurm-users writes: > Ours works fine, however, without the InteractiveStepOptions parameter. My assumption is also that default value should be OK. It would be nice if some one could confirm that 23.11.10 was working for them. However, we'll probably be upgrading to 24.5 fairly soon, and so we shall see whether the issue persists. Cheers, Loris > JLS > > On Thu, Sep 5, 2024 at 9:53 AM Carsten Beyer via slurm-users > wrote: > > Hi Loris, > > we use SLURM 23.02.7 (Production) and 23.11.1 (Testsystem). Our config > contains a second parameter InteractiveStepOptions in slurm.conf: > > InteractiveStepOptions="--interactive --preserve-env --pty $SHELL -l" > LaunchParameters=enable_nss_slurm,use_interactive_step > > That works fine for us: > > [k202068@levantetest ~]$ salloc -N1 -A k20200 -p compute > salloc: Pending job allocation 857 > salloc: job 857 queued and waiting for resources > salloc: job 857 has been allocated resources > salloc: Granted job allocation 857 > salloc: Waiting for resource configuration > salloc: Nodes lt1 are ready for job > [k202068@lt10000 ~]$ > > Best Regards, > Carsten > > Am 05.09.24 um 14:17 schrieb Loris Bennett via slurm-users: > > Hi, > > > > With > > > >$ salloc --version > >slurm 23.11.10 > > > > and > > > >$ grep LaunchParameters /etc/slurm/slurm.conf > >LaunchParameters=use_interactive_step > > > > the following > > > >$ salloc --partition=interactive --ntasks=1 --time=00:03:00 --mem=1000 > --qos=standard > >salloc: Granted job allocation 18928869 > >salloc: Nodes c001 are ready for job > > > > creates a job > > > >$ squeue --me > > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) > > 18928779 interacti interactloris R 1:05 1 c001 > > > > but causes the terminal to block. > > > > From a second terminal I can log into the compute node: > > > >$ ssh c001 > >[13:39:36] loris@c001 (1000) ~ > > > > Is that the expected behaviour or should salloc return a shell directly > > on the compute node (like srun --pty /bin/bash -l used to do)? > > > > Cheers, > > > > Loris > > > -- > Carsten Beyer > Abteilung Systeme > > Deutsches Klimarechenzentrum GmbH (DKRZ) > Bundesstraße 45a * D-20146 Hamburg * Germany > > Phone: +49 40 460094-221 > Fax:+49 40 460094-270 > Email: be...@dkrz.de > URL:http://www.dkrz.de > > Geschäftsführer: Prof. Dr. Thomas Ludwig > Sitz der Gesellschaft: Hamburg > Amtsgericht Hamburg HRB 39784 > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com > > -- > Jason L. Simms, Ph.D., M.P.H. > Manager of Research Computing > Swarthmore College > Information Technology Services > (610) 328-8102 > Schedule a meeting: https://calendly.com/jlsimms -- Dr. Loris Bennett (Herr/Mr) FUB-IT, Freie Universität Berlin -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: salloc not starting shell despite LaunchParameters=use_interactive_step
Paul Edmon via slurm-users writes: > Its definitely working for 23.11.8, which is what we are using. It turns out we had unintentionally started firewalld on the login node. Now this has been turned off, 'salloc' drops into a shell on a compute node as desired. Thanks for all the data points. Cheers, Loris > -Paul Edmon- > > On 9/5/24 10:22 AM, Loris Bennett via slurm-users wrote: >> Jason Simms via slurm-users writes: >> >>> Ours works fine, however, without the InteractiveStepOptions parameter. >> My assumption is also that default value should be OK. >> >> It would be nice if some one could confirm that 23.11.10 was working for >> them. However, we'll probably be upgrading to 24.5 fairly soon, and so >> we shall see whether the issue persists. >> >> Cheers, >> >> Loris >> >>> JLS >>> >>> On Thu, Sep 5, 2024 at 9:53 AM Carsten Beyer via slurm-users >>> wrote: >>> >>> Hi Loris, >>> >>> we use SLURM 23.02.7 (Production) and 23.11.1 (Testsystem). Our config >>> contains a second parameter InteractiveStepOptions in slurm.conf: >>> >>> InteractiveStepOptions="--interactive --preserve-env --pty $SHELL -l" >>> LaunchParameters=enable_nss_slurm,use_interactive_step >>> >>> That works fine for us: >>> >>> [k202068@levantetest ~]$ salloc -N1 -A k20200 -p compute >>> salloc: Pending job allocation 857 >>> salloc: job 857 queued and waiting for resources >>> salloc: job 857 has been allocated resources >>> salloc: Granted job allocation 857 >>> salloc: Waiting for resource configuration >>> salloc: Nodes lt1 are ready for job >>> [k202068@lt1 ~]$ >>> >>> Best Regards, >>> Carsten >>> >>> Am 05.09.24 um 14:17 schrieb Loris Bennett via slurm-users: >>> > Hi, >>> > >>> > With >>> > >>> >$ salloc --version >>> >slurm 23.11.10 >>> > >>> > and >>> > >>> >$ grep LaunchParameters /etc/slurm/slurm.conf >>> >LaunchParameters=use_interactive_step >>> > >>> > the following >>> > >>> >$ salloc --partition=interactive --ntasks=1 --time=00:03:00 >>> --mem=1000 --qos=standard >>> >salloc: Granted job allocation 18928869 >>> >salloc: Nodes c001 are ready for job >>> > >>> > creates a job >>> > >>> >$ squeue --me >>> > JOBID PARTITION NAME USER ST TIME NODES >>> NODELIST(REASON) >>> > 18928779 interacti interactloris R 1:05 1 >>> c001 >>> > >>> > but causes the terminal to block. >>> > >>> > From a second terminal I can log into the compute node: >>> > >>> >$ ssh c001 >>> >[13:39:36] loris@c001 (1000) ~ >>> > >>> > Is that the expected behaviour or should salloc return a shell directly >>> > on the compute node (like srun --pty /bin/bash -l used to do)? >>> > >>> > Cheers, >>> > >>> > Loris >>> > >>> -- >>> Carsten Beyer >>> Abteilung Systeme >>> >>> Deutsches Klimarechenzentrum GmbH (DKRZ) >>> Bundesstraße 45a * D-20146 Hamburg * Germany >>> >>> Phone: +49 40 460094-221 >>> Fax:+49 40 460094-270 >>> Email: be...@dkrz.de >>> URL:http://www.dkrz.de >>> >>> Geschäftsführer: Prof. Dr. Thomas Ludwig >>> Sitz der Gesellschaft: Hamburg >>> Amtsgericht Hamburg HRB 39784 >>> >>> -- >>> slurm-users mailing list -- slurm-users@lists.schedmd.com >>> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com >>> >>> -- Jason L. Simms, Ph.D., M.P.H. >>> Manager of Research Computing >>> Swarthmore College >>> Information Technology Services >>> (610) 328-8102 >>> Schedule a meeting: https://calendly.com/jlsimms -- Dr. Loris Bennett (Herr/Mr) FUB-IT, Freie Universität Berlin -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Unable to receive password reminder
Hi, Over a week ago I sent the message below to the address I found for the list owner, but have not received a response. Does anyone know how to proceed in this case? Cheers, Loris Start of forwarded message From: Loris Bennett To: Subject: Unable to receive password reminder Date: Mon, 6 Jan 2025 08:35:42 +0100 Dear list owner, I have recently switched from reading the list via mail to using the mail to news gateway at news.gmane.io. Therefore I would like change my mailman settings in order to stop delivery of postings via mail. As I have forgotten my list password, I requested a reminder. However I get the reply that no user with the given email address was found in the user database. The addresses I tried were loris.benn...@fu-berlin.de lo...@zedat.fu-berlin.de the former being an alias for the latter. This is the email account which to which emails from the list are sent, so I am somewhat confused as to why the neither of the addresses is recognised. Could you please help me to resolve this issue? Regards Loris Bennett -- Dr. Loris Bennett (Herr/Mr) FUB-IT, Freie Universität Berlin End of forwarded message -- Dr. Loris Bennett (Herr/Mr) FUB-IT, Freie Universität Berlin -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Minimum cpu cores per node partition level configuration
Hi Tim, "Cutts, Tim via slurm-users" writes: > You can set a partition QoS which specifies a minimum. We have such a qos on > our large-gpu partition; we don’t want people scheduling small stuff to it, > so we > have this qos: How does this affect total throughput? Presumably, 'small' GPU jobs might potentially have to wait for resources in other partitions, even if resources are free in 'large-gpu'. Do you other policies which ameliorate this? Cheers, Loris [snip (135 lines)] -- Dr. Loris Bennett (Herr/Mr) FUB-IT, Freie Universität Berlin -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: How can we put limits on interactive jobs?
Hi Ole, Ole Holm Nielsen via slurm-users writes: > We would like to put limits on interactive jobs (started by salloc) so > that users don't leave unused interactive jobs behind on the cluster > by mistake. > > I can't offhand find any configurations that limit interactive jobs, > such as enforcing a timelimit. > > Perhaps this could be done in job_submit.lua, but I couldn't find any > job_desc parameters in the source code which would indicate if a job > is interactive or not. > > Question: How do people limit interactive jobs, or identify orphaned > jobs and kill them? We would be interested in this too. Currently we have a very make-shift solution which involves a script which simply pipes all running job IDs to 'sjeff' (https://github.com/ubccr/stubl/blob/master/bin/sjeff) every 30s. This produces an output like the following: UsernameMem_Request Max_Mem_Use CPU_Efficiency Number_of_CPUs_In_Use able3600M0.94Gn 99.22% (142.88 of 144) baker 8G 0.90Gn0.60% (0.02 of 4) charlie varied 32.92Gn 42.54% (5.96 of 14) ... == CPU efficiency: data above from Fri 25 Apr 11:17:09 CEST 2025 == where efficiencies under 50% are printed in red. As long as one only has about a screenful of users, it is fairly easy to spot users with a low CPU efficiency, whether it be due to idle interactive jobs or caused by something else. Apart from that, we have a partition called 'interactive' which has an appropriately short MaxTime. We don't actually lie to our users by saying that they have to used this partition, but we don't advertise the fact they could use any of the other partitions for interactive work. This is obviously also even more make-shift :-) Cheers, Loris > Thanks a lot, > Ole > > -- > Ole Holm Nielsen > PhD, Senior HPC Officer > Department of Physics, Technical University of Denmark -- Dr. Loris Bennett (Herr/Mr) FUB-IT, Freie Universität Berlin -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Large memory jobs stuck Pending. Should use --time parameter?
Mike via slurm-users writes: > Greetings, > > We are new to Slurm and we are trying to better understand why we’re seeing > high-mem jobs stuck in Pending state indefinitely. Smaller (mem) jobs in the > queue will continue to pass by the high mem jobs even when we bump priority > on a pending high-mem job way up. We have been reading over the backfill > scheduling page and what we think we're seeing is that we need to require > that users specify a --time parameter on their jobs so that Backfill works > properly. > None of our users specify a --time param because we have never required it. > Is that what we need to require in order to fix this situation? From the > backfill > page: "Backfill scheduling is difficult without reasonable time limit > estimates for jobs, but some configuration parameters that can help" and it > goes on to list > some config params that we have not set (DefaultTime, MaxTime, > OverTimeLimit). We also see language such as, “Since the expected start time > of pending jobs > depends upon the expected completion time of running jobs, reasonably > accurate time limits are important for backfill scheduling to work well.” So > we > suspect that we can achieve proper backfill scheduling by requiring that all > users supply a "--time" parameter via a job submit plugin. Would that be a > fair > statement? You might also need to look at the configuration parameter SchedulerParameters in particular bf_window=# The number of minutes into the future to look when considering jobs to schedule. Higher values result in more overhead and less respon‐ siveness. A value at least as long as the highest allowed time limit is generally advisable to prevent job starvation. In order to limit the amount of data managed by the backfill scheduler, if the value of bf_window is increased, then it is generally advisable to also increase bf_resolution. This option applies only to Scheduler‐ Type=sched/backfill. Default: 1440 (1 day), Min: 1, Max: 43200 (30 days). Regards Loris Bennett -- Dr. Loris Bennett (Herr/Mr) FUB-IT, Freie Universität Berlin -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com