PropagateResourceLimitsExcept won't do it?
Od: Dj Merrill via slurm-users
Poslano: sreda, 15. maj 2024 09:43
Za: slurm-users@lists.schedmd.com
Zadeva: [EXTERNAL] [slurm-users] Re: srun weirdness
Thank you Hemann and Tom! That was it.
The new cluster ha
> sudo systemctl restart slurmd # gets stuck
Are you able to restart other services on this host? Anything weird in its
dmesg?
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
I wrote a job_submit.lua also. It would append "¢os79" to the feature
string unless the features already contained "el9," or if empty, set the
features string to "centos79" without the ampersand. I didn't hear from any
users doing anything fancy enough with their feature string for the ampersa
> Could you post that snippet?
function slurm_job_submit ( job_desc, part_list, submit_uid )
if job_desc.features then
if not string.find(job_desc.features,"el9") then
job_desc.features = job_desc.features .. '¢os79'
end
else
job_desc.features = "centos79"
end
return slur
There's an enum job_states in slurm.h. It becomes OUT_OF_MEMORY, &c. in the
job_state_string function in slurm_protocol_defs.c.
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
Hi Henrique. Can you give an example of sharing being unavoidable?
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
So you're wanting that, instead of waiting for the task to finish and then
running on the whole node, that the job should run immediately on n-1 CPUs? If
there were only one CPU available in the entire cluster, would you want the job
to start running immediately on one CPU instead of waiting fo
My read is that Henrique wants to specify a job to require a variable number of
CPUs on one node, so that when the job is at the front of the queue, it will
run opportunistically on however many happen to be available on a single node
as long as there are at least five.
I don't personally know
Can whatever is running those sbatch commands add a --comment with a shared
identifier that AccountingStoreFlags=job_comment would make available in sacct?
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
Apologies if I'm missing this in your post, but do you in fact have a Prolog
configured in your slurm.conf?
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
Your slurm.conf should be the same on all machines (is it? you don't have
Prolog configured on some but not others?), but no, it is not mandatory to use
a prolog. I am simply surprised that you could get a "Prolog error" without
having a prolog configured, since an error in the prolog program
> I know you can show job info and find what dependency a job is waiting
> for, But more after checking if there are jobs waiting on the current
> job to complete using the job ID,
You mean you don't wanna like
squeue -o%i,%E | grep SOME_JOBID
?
Although I guess that won't catch a matching `s
What do you have for `sacctmgr list account`? If "root" is your top-level
(Slurm) (bank) account, AllowAccounts=root may just end up meaning any account.
To have AllowAccounts limit what users can submit, you'd need to name a
lower-level Slurm (bank) account that only some users have an Associ
If you run
sshare
or
sacctmgr show association where parent=root
(and so forth recursively where parent= each of the children) do you find that
these other accounts that can submit to the partition are not ultimately
sub-accounts of "root"?
Quoting man 5 slurm.conf, concerning AllowAccou
Has anyone else noticed, somewhere between versions 22.05.11 and 23.11.9,
losing fixed Features defined for a node in slurm.conf, and instead now just
having those controlled by a NodeFeaturesPlugin like node_features/knl_generic?
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To
Who is uid=64030?
What is in the slurmctld log for the same timeframe?
How does `sacct -j 1079` say the job ended?
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
Is your regular user unable to read the slurm.conf? How is the cluster set up
to get the hostname of the Slurm controller?
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
> When I run something like `srun -n 10 echo HELLO', I get HELLO
> returned to my console/stdout 10x. When I submit this command
> as a script to the /jobs/submit route, I get success/200, but I
> cannot determine how to get the console output of HELLO 10x in
> any form. It's not in my stdout log
In addition to checking under /sys/fs/cgroup like Tim said, if this is just to
convince yourself that you got the CPU restriction working, you could also open
`top` on the host running the job and observe that %CPU is now being held to
200,0 or lower (or if its multiple processes per job, summin
> Socket(s): 1
> NUMA node(s):1
> [...]
> NodeName=mysystem Autodetect=off Name=gpu Type=geforce_gtx_titan_x
> File=/dev/nvidia0 CPUs=0-1
> NodeName=mysystem Autodetect=off Name=gpu Type=geforce_gtx_titan_black
> File=/dev/nvidia1 CPUs=2-3
What do you intend to achieve with CPU
Hi Manisha. Does your Slurm build/installation have burst_buffer_lua.so?
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
> After running a simple “helloworld” test, I have noticed that if
> SelectTypeParameters=CR_Core, system always reserves me an even
> number of CPUs (during “pending” time, I can see the real number
> I have requested, but when job starts “running”, that number is
> increased to the next even numb
22 matches
Mail list logo