[slurm-users] Re: srun weirdness

2024-05-15 Thread Laura Hild via slurm-users
PropagateResourceLimitsExcept won't do it? Od: Dj Merrill via slurm-users Poslano: sreda, 15. maj 2024 09:43 Za: slurm-users@lists.schedmd.com Zadeva: [EXTERNAL] [slurm-users] Re: srun weirdness Thank you Hemann and Tom! That was it. The new cluster ha

[slurm-users] Re: Jobs showing running but not running

2024-05-29 Thread Laura Hild via slurm-users
> sudo systemctl restart slurmd # gets stuck Are you able to restart other services on this host? Anything weird in its dmesg? -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Node (anti?) Feature / attribute

2024-06-14 Thread Laura Hild via slurm-users
I wrote a job_submit.lua also. It would append "¢os79" to the feature string unless the features already contained "el9," or if empty, set the features string to "centos79" without the ampersand. I didn't hear from any users doing anything fancy enough with their feature string for the ampersa

[slurm-users] Re: Node (anti?) Feature / attribute

2024-06-17 Thread Laura Hild via slurm-users
> Could you post that snippet? function slurm_job_submit ( job_desc, part_list, submit_uid ) if job_desc.features then if not string.find(job_desc.features,"el9") then job_desc.features = job_desc.features .. '¢os79' end else job_desc.features = "centos79" end return slur

[slurm-users] Re: Job Step State

2024-07-12 Thread Laura Hild via slurm-users
There's an enum job_states in slurm.h. It becomes OUT_OF_MEMORY, &c. in the job_state_string function in slurm_protocol_defs.c. -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: With slurm, how to allocate a whole node for a single multi-threaded process?

2024-08-01 Thread Laura Hild via slurm-users
Hi Henrique. Can you give an example of sharing being unavoidable? -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: With slurm, how to allocate a whole node for a single multi-threaded process?

2024-08-01 Thread Laura Hild via slurm-users
So you're wanting that, instead of waiting for the task to finish and then running on the whole node, that the job should run immediately on n-1 CPUs? If there were only one CPU available in the entire cluster, would you want the job to start running immediately on one CPU instead of waiting fo

[slurm-users] Re: With slurm, how to allocate a whole node for a single multi-threaded process?

2024-08-02 Thread Laura Hild via slurm-users
My read is that Henrique wants to specify a job to require a variable number of CPUs on one node, so that when the job is at the front of the queue, it will run opportunistically on however many happen to be available on a single node as long as there are at least five. I don't personally know

[slurm-users] Re: Best practices for tracking jobs started across multiple clusters for accounting purposes.

2024-08-30 Thread Laura Hild via slurm-users
Can whatever is running those sbatch commands add a --comment with a shared identifier that AccountingStoreFlags=job_comment would make available in sacct? -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Randomly draining nodes

2024-10-08 Thread Laura Hild via slurm-users
Apologies if I'm missing this in your post, but do you in fact have a Prolog configured in your slurm.conf? -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Randomly draining nodes

2024-10-15 Thread Laura Hild via slurm-users
Your slurm.conf should be the same on all machines (is it? you don't have Prolog configured on some but not others?), but no, it is not mandatory to use a prolog. I am simply surprised that you could get a "Prolog error" without having a prolog configured, since an error in the prolog program

[slurm-users] Re: Dependency jobs

2024-10-16 Thread Laura Hild via slurm-users
> I know you can show job info and find what dependency a job is waiting > for, But more after checking if there are jobs waiting on the current > job to complete using the job ID, You mean you don't wanna like squeue -o%i,%E | grep SOME_JOBID ? Although I guess that won't catch a matching `s

[slurm-users] Re: Why AllowAccounts not work in slurm-23.11.6

2024-10-19 Thread Laura Hild via slurm-users
What do you have for `sacctmgr list account`? If "root" is your top-level (Slurm) (bank) account, AllowAccounts=root may just end up meaning any account. To have AllowAccounts limit what users can submit, you'd need to name a lower-level Slurm (bank) account that only some users have an Associ

[slurm-users] Re: Why AllowAccounts not work in slurm-23.11.6

2024-10-30 Thread Laura Hild via slurm-users
If you run sshare or sacctmgr show association where parent=root (and so forth recursively where parent= each of the children) do you find that these other accounts that can submit to the partition are not ultimately sub-accounts of "root"? Quoting man 5 slurm.conf, concerning AllowAccou

[slurm-users] loss of "unchangeable" node features

2024-10-23 Thread Laura Hild via slurm-users
Has anyone else noticed, somewhere between versions 22.05.11 and 23.11.9, losing fixed Features defined for a node in slurm.conf, and instead now just having those controlled by a NodeFeaturesPlugin like node_features/knl_generic? -- slurm-users mailing list -- slurm-users@lists.schedmd.com To

[slurm-users] Re: jobs dropping

2024-11-12 Thread Laura Hild via slurm-users
Who is uid=64030? What is in the slurmctld log for the same timeframe? How does `sacct -j 1079` say the job ended? -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Issue running slurm commands as normal account but work as root.

2025-02-05 Thread Laura Hild via slurm-users
Is your regular user unable to read the slurm.conf? How is the cluster set up to get the hostname of the Slurm controller? -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: slurmrestd equivalent to "srun -n 10 echo HELLO"

2025-03-24 Thread Laura Hild via slurm-users
> When I run something like `srun -n 10 echo HELLO', I get HELLO > returned to my console/stdout 10x. When I submit this command > as a script to the /jobs/submit route, I get success/200, but I > cannot determine how to get the console output of HELLO 10x in > any form. It's not in my stdout log

[slurm-users] Re: Using more cores/CPUs that requested with

2025-03-26 Thread Laura Hild via slurm-users
In addition to checking under /sys/fs/cgroup like Tim said, if this is just to convince yourself that you got the CPU restriction working, you could also open `top` on the host running the job and observe that %CPU is now being held to 200,0 or lower (or if its multiple processes per job, summin

[slurm-users] Re: Error "_check_core_range_matches_sock" when starting "slurmctld"

2025-04-28 Thread Laura Hild via slurm-users
> Socket(s): 1 > NUMA node(s):1 > [...] > NodeName=mysystem Autodetect=off Name=gpu Type=geforce_gtx_titan_x > File=/dev/nvidia0 CPUs=0-1 > NodeName=mysystem Autodetect=off Name=gpu Type=geforce_gtx_titan_black > File=/dev/nvidia1 CPUs=2-3 What do you intend to achieve with CPU

[slurm-users] Re: Assistance with Burst Buffer Configuration in slurm

2025-03-12 Thread Laura Hild via slurm-users
Hi Manisha. Does your Slurm build/installation have burst_buffer_lua.so? -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Doubt with SelectTypeParameters in slurm.conf

2025-03-28 Thread Laura Hild via slurm-users
> After running a simple “helloworld” test, I have noticed that if > SelectTypeParameters=CR_Core, system always reserves me an even > number of CPUs (during “pending” time, I can see the real number > I have requested, but when job starts “running”, that number is > increased to the next even numb