[slurm-users] Re: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm

2025-04-15 Thread Christopher Samuel via slurm-users
Hiya, On 4/15/25 7:03 pm, lyz--- via slurm-users wrote: Hi, Christ. Thank you for continuing paying attention to this issue. I followed your instuction. And This is the output: [root@head1 ~]# systemctl cat slurmd | fgrep Delegate Delegate=yes That looks good to me, thanks for sharing that!

[slurm-users] Re: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm

2025-04-15 Thread Christopher Samuel via slurm-users
On 4/15/25 6:57 pm, lyz--- via slurm-users wrote: Hi, Sean. It's the latest slurm version. [root@head1 ~]# sinfo --version slurm 22.05.3 That's quite old (and no longer supported), the oldest still supported version is 23.11.10 and 24.11.4 came out recently. What does the cgroup.conf file o

[slurm-users] Re: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm

2025-04-15 Thread Christopher Samuel via slurm-users
On 4/15/25 12:55 pm, Sean Crosby via slurm-users wrote: What version of Slurm are you running and what's the contents of your gres.conf file? Also what does this say? systemctl cat slurmd | fgrep Delegate -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users maili

[slurm-users] Re: Issue with Enforcing GPU Usage Limits in Slurm

2025-04-14 Thread Christopher Samuel via slurm-users
On 4/14/25 6:27 am, lyz--- via slurm-users wrote: This command is intended to limit user 'lyz' to using a maximum of 2 GPUs. However, when the user submits a job using srun, specifying CUDA 0, 1, 2, and 3 in the job script, or os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3", the job still utili

[slurm-users] Re: errors while trying to setup slurmdbd.

2025-04-09 Thread Christopher Samuel via slurm-users
Hi Steven, On 4/9/25 5:00 pm, Steven Jones via slurm-users wrote: Apr 10 10:28:52 vuwunicohpcdbp1.ods.vuw.ac.nz slurmdbd[2413]: slurmdbd: fatal: This host not configured to run SlurmDBD ((vuwunicohpcdbp1 or vuwunicohp> ^^^ that's the critical error message, and it's reporting that because s

[slurm-users] Re: mariadb refusing access

2025-03-04 Thread Christopher Samuel via slurm-users
On 3/4/25 5:23 pm, Steven Jones via slurm-users wrote: However   mysql -u slurm -p   works just fine so it seems to be a config error for slurmdbd Try: mysql -h 127.0.0.1 -u slurm -p IIRC without that it'll try a UNIX domain socket and not try and connect via TCP/IP. -- Chris Samuel : h

[slurm-users] Re: jobs getting stuck in CG

2025-02-10 Thread Christopher Samuel via slurm-users
On 2/10/25 7:05 am, Michał Kadlof via slurm-users wrote: I observed similar symptoms when we had issues with the shared Lustre file system. When the file system couldn't complete an I/O operation, the process in Slurm remained in the CG state until the file system became responsive again. An a

[slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld

2025-02-03 Thread Christopher Samuel via slurm-users
On 2/3/25 2:33 pm, Steven Jones via slurm-users wrote: Just built 4 x rocky9 nodes and I do not get that error (but I get another I know how to fix, I think) so holistically  I am thinking the version difference is too large. Oh I think I missed this - when you say version difference do you m

[slurm-users] Re: sinfo not listing any partitions

2024-11-27 Thread Christopher Samuel via slurm-users
On 11/27/24 11:38 am, Kent L. Hanson via slurm-users wrote: I have restarted the slurmctld and slurmd services several times. I hashed the slurm.conf files. They are the same. I ran “sinfo -a” as root with the same result. Are your nodes in the `FUTURE` state perhaps? What does this show? si

[slurm-users] Re: Job pre / post submit scripts

2024-10-28 Thread Christopher Samuel via slurm-users
On 10/28/24 10:56 am, Bhaskar Chakraborty via slurm-users wrote: Is there an option in slurm to launch a custom script at the time of job submission through sbatch or salloc? The script should run with submit user permission in submit area. I think you are after the cli_filter functionality w

[slurm-users] Re: Randomly draining nodes

2024-10-24 Thread Christopher Samuel via slurm-users
Hi Ole, On 10/22/24 11:04 am, Ole Holm Nielsen via slurm-users wrote: Some time ago it was recommended that UnkillableStepTimeout values above 127 (or 256?) should not be used, see https://support.schedmd.com/ show_bug.cgi?id=11103.  I don't know if this restriction is still valid with recent

[slurm-users] Re: Randomly draining nodes

2024-10-21 Thread Christopher Samuel via slurm-users
On 10/21/24 4:35 am, laddaoui--- via slurm-users wrote: It seems like there's an issue with the termination process on these nodes. Any thoughts on what could be causing this? That usually means processes wedged in the kernel for some reason, in an uninterruptible sleep state. You can define

[slurm-users] Re: REST API - get_user_environment

2024-08-15 Thread Christopher Samuel via slurm-users
On 8/15/24 7:04 am, jpuerto--- via slurm-users wrote: I am referring to the REST API. We have had it installed for a few years and have recently upgraded it so that we can use v0.0.40. But this most recent version is missing the "get_user_environment" field which existed in previous versions.

[slurm-users] Re: Upgrade node while jobs running

2024-08-02 Thread Christopher Samuel via slurm-users
G'day Sid, On 7/31/24 5:02 pm, Sid Young via slurm-users wrote: I've been waiting for node to become idle before upgrading them however some jobs take a long time. If I try to remove all the packages I assume that kills the slurmstep program and with it the job. Are you looking to do a Slurm

[slurm-users] Re: Can Not Use A Single GPU for Multiple Jobs

2024-06-21 Thread Christopher Samuel via slurm-users
On 6/21/24 3:50 am, Arnuld via slurm-users wrote: I have 3500+ GPU cores available. You mean each GPU job requires at least one CPU? Can't we run a job with just GPU without any CPUs? No, Slurm has to launch the batch script on compute node cores and it then has the job of launching the users

[slurm-users] Re: Unsupported RPC version by slurmctld 19.05.3 from client slurmd 22.05.11

2024-06-17 Thread Christopher Samuel via slurm-users
On 6/17/24 7:24 am, Bjørn-Helge Mevik via slurm-users wrote: Also, server must be newer than client. This is the major issue for the OP - the version rule is: slurmdbd >= slurmctld >= slurmd and clients and no more than the permitted skew in versions. Plus, of course, you have to deal with

[slurm-users] Re: Building Slurm debian package vs building from source

2024-05-23 Thread Christopher Samuel via slurm-users
On 5/22/24 3:33 pm, Brian Andrus via slurm-users wrote: A simple example is when you have nodes with and without GPUs. You can build slurmd packages without for those nodes and with for the ones that have them. FWIW we have both GPU and non-GPU nodes but we use the same RPMs we build on both

[slurm-users] Re: Location of Slurm source packages?

2024-05-15 Thread Christopher Samuel via slurm-users
Hi Jeff! On 5/15/24 10:35 am, Jeffrey Layton via slurm-users wrote: I have an Ubuntu 22.04 server where I installed Slurm from the Ubuntu packages. I now want to install pyxis but it says I need the Slurm sources. In Ubuntu 22.04, is there a package that has the source code? How to download t

[slurm-users] Re: FreeBSD/aarch64: ld: error: unknown emulation: elf_aarch64

2024-05-06 Thread Christopher Samuel via slurm-users
On 5/6/24 3:19 pm, Nuno Teixeira via slurm-users wrote: Fixed with: [...] Thanks and sorry for the noise as I really missed this detail :) So glad it helped! Best of luck with this work. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slu

[slurm-users] Re: FreeBSD/aarch64: ld: error: unknown emulation: elf_aarch64

2024-05-06 Thread Christopher Samuel via slurm-users
On 5/6/24 6:38 am, Nuno Teixeira via slurm-users wrote: Any clues about "elf_aarch64" and "aarch64elf" mismatch? As I mentioned I think this is coming from the FreeBSD patching that's being done to the upstream Slurm sources, specifically it looks like elf_aarch64 is being injected here: /

[slurm-users] Re: FreeBSD/aarch64: ld: error: unknown emulation: elf_aarch64

2024-05-04 Thread Christopher Samuel via slurm-users
On 5/4/24 4:24 am, Nuno Teixeira via slurm-users wrote: Any clues? > ld: error: unknown emulation: elf_aarch64 All I can think is that your ld doesn't like elf_aarch64, from the log your posting it looks that's being injected from the FreeBSD ports system. Looking at the man page for ld on

[slurm-users] Re: Jobs of a user are stuck in Completing stage for a long time and cannot cancel them

2024-04-10 Thread Christopher Samuel via slurm-users
On 4/10/24 10:41 pm, archisman.pathak--- via slurm-users wrote: In our case, that node has been removed from the cluster and cannot be added back right now ( is being used for some other work ). What can we do in such a case? Mark the node as "DOWN" in Slurm, this is what we do when we get job

[slurm-users] Re: Is SWAP memory mandatory for SLURM

2024-03-04 Thread Christopher Samuel via slurm-users
On 3/3/24 23:04, John Joseph via slurm-users wrote: Is SWAP a mandatory requirement All our compute nodes are diskless, so no swap on them. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an e

[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo

2024-02-23 Thread Christopher Samuel via slurm-users
Hi Robert, On 2/23/24 17:38, Robert Kudyba via slurm-users wrote: We switched over from using systemctl for tmp.mount and change to zram, e.g., modprobe zram echo 20GB > /sys/block/zram0/disksize mkfs.xfs /dev/zram0 mount -o discard /dev/zram0 /tmp [...] > [2024-02-23T20:26:15.881] [530.exter

Re: [slurm-users] sacct --name --status filtering

2024-01-10 Thread Christopher Samuel
On 1/10/24 19:39, Drucker, Daniel wrote: What am I misunderstanding about how sacct filtering works here? I would have expected the second command to show the exact same results as the first. You need to specify --end NOW for this to work as expected. From the man page: WITHOUT --jobs AN

Re: [slurm-users] parastation (mpi)

2023-11-24 Thread Christopher Samuel
On 11/24/23 06:16, Heckes, Frank wrote: My colleagues are using this toolchains on Jülich cluster (especially Juwels). My question is whether these eb files can be shared ? I would be interested especially in the ones using NVHPC as core module. If Jülich developed that toolchain then I think

Re: [slurm-users] SLURM , maximum scalable instance is which one

2023-11-06 Thread Christopher Samuel
On 10/29/23 03:13, John Joseph wrote: Like to know that what is the maximum scalled up instance of SLURM so far. Cori (which we retired mid-year) had ~12,000 compute nodes in case that helps. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] scontrol reboot does not allow new jobs to be scheduled if nextstate=RESUME is set

2023-10-24 Thread Christopher Samuel
On 10/24/23 12:39, Tim Schneider wrote: Now my issue is that when I run "scontrol reboot ASAP nextstate=RESUME ", the node goes in "mix@" state (not drain), but no new jobs get scheduled until the node reboots. Essentially I get draining behavior, even though the node's state is not "drain". N

Re: [slurm-users] Slurm versions 23.02.6 and 22.05.10 are now available (CVE-2023-41914)

2023-10-16 Thread Christopher Samuel
On 10/16/23 08:22, Groner, Rob wrote: It is my understanding that it is a different issue than pmix. That's my understanding too. The PMIx issue wasn't in Slurm, it was in the PMIx code that Slurm was linked to. This CVE is for Slurm itself. -- Chris Samuel : http://www.csamuel.org/ : B

Re: [slurm-users] Fairshare: Penalising unused memory rather than used memory?

2023-10-15 Thread Christopher Samuel
On 10/11/23 07:27, Cristian Huza wrote: I recall there was a built in tool named seff (slurm efficiency), not sure if it is still maintained "seff" is in the Slurm sources in the contribs/seff directory, if you're building RPMs from them then it's in the "slurm-contribs" RPM. -- Chris Samue

Re: [slurm-users] Site factor plugin example?

2023-10-15 Thread Christopher Samuel
On 10/13/23 10:10, Angel de Vicente wrote: But, in any case, I would still be interested in a site factor plugin example, because I might revisit this in the future. I don't know if you saw, but there is a skeleton example in the Slurm sources: src/plugins/site_factor/none Not sure if that

Re: [slurm-users] Unconfigured GPUs being allocated

2023-08-02 Thread Christopher Samuel
On 7/14/23 1:10 pm, Wilson, Steven M wrote: It's not so much whether a job may or may not access the GPU but rather which GPU(s) is(are) included in $CUDA_VISIBLE_DEVICES. That is what controls what our CUDA jobs can see and therefore use (within any cgroups constraints, of course). In my case

Re: [slurm-users] slurmdbd database usage

2023-08-02 Thread Christopher Samuel
On 8/2/23 2:30 pm, Sandor wrote: I am looking to track accounting and job data. Slurm requires the use of MySQL or MariaDB. Has anyone created the needed tables within PostGreSQL then had slurmdbd write to it? Any problems? From memory (and confirmed by git) support for Postgres was removed

Re: [slurm-users] Unconfigured GPUs being allocated

2023-07-14 Thread Christopher Samuel
On 7/14/23 10:20 am, Wilson, Steven M wrote: I upgraded Slurm to 23.02.3 but I'm still running into the same problem. Unconfigured GPUs (those absent from gres.conf and slurm.conf) are still being made available to jobs so we end up with compute jobs being run on GPUs which should only be used

Re: [slurm-users] Trying to update from slurm 19.05 to slurm 23.02 but I can't figure out how to allow users to reboot nodes...

2023-06-06 Thread Christopher Samuel
On 6/6/23 1:33 pm, Heinz, Michael wrote: I've gone through the man pages for slurm.conf but I can't find anything about how to define who the admins are? Is there still a way to do this with slurm or has the ability been removed? Looks like that was disabled over 3 years ago. commit dd111a5

Re: [slurm-users] Temporary Stop User Submission

2023-05-26 Thread Christopher Samuel
On 5/25/23 4:16 pm, Markuske, William wrote: I have a badly behaving user that I need to speak with and want to temporarily disable their ability to submit jobs. I know I can change their account settings to stop them. Is there another way to set a block on a specific username that I can lift

Re: [slurm-users] Usage gathering for GPUs

2023-05-24 Thread Christopher Samuel
On 5/24/23 11:39 am, Fulton, Ben wrote: Hi, Hi Ben, The release notes for 23.02 say “Added usage gathering for gpu/nvml (Nvidia) and gpu/rsmi (AMD) plugins”. How would I go about enabling this? I can only comment on the nvidia side (as those are the GPUs we have) but for that you need S

Re: [slurm-users] [EXTERNAL] Re: Question about PMIX ERROR messages being emitted by some child of srun process

2023-05-23 Thread Christopher Samuel
On 5/23/23 10:33 am, Pritchard Jr., Howard wrote: Thanks Christopher, No worries! This doesn't seem to be related to Open MPI at all except that for our 5.0.0 and newer one has to use PMix to talk to the job launcher. I built MPICH 4.1 on Perlmutter using the --with-pmix option and see a si

Re: [slurm-users] Question about PMIX ERROR messages being emitted by some child of srun process

2023-05-22 Thread Christopher Samuel
Hi Tommi, Howard, On 5/22/23 12:16 am, Tommi Tervo wrote: 23.02.2 contains PMIx permission regression, it may be worth to check if it's case? I confirmed I could replicate the UNPACK-INADEQUATE-SPACE messages Howard is seeing on a test system, so I tried that patch on that same system with

Re: [slurm-users] From an initial installation cannot start slurmctld with a slurmdbd running

2023-05-17 Thread Christopher Samuel
Hi Lawrence, On 5/17/23 3:26 pm, Sorrillo, Lawrence wrote: Here is the error I get: slurmctld: fatal: Can not recover assoc_usage state, incompatible version, got 9728 need >= 8704 <= 9216, The slurm version is:  20.11.9 That error seems to appear when slurmctld is loading usage data from

Re: [slurm-users] PreemptExemptTime

2023-03-07 Thread Christopher Samuel
On 3/7/23 6:46 am, Groner, Rob wrote: Over global settings are PreemptMode=SUSPEND,GANG and PreemptType=preempt/partition_prio.  We have a high priority partition that nothing should ever preempt, and an open partition that is always preemptable.  In between is a burst partition.  It can be pr

Re: [slurm-users] I just had a "conversation" with ChatGPT about working DMTCP, OpenMPI and SLURM. Here are the results

2023-02-18 Thread Christopher Samuel
On 2/10/23 11:06 am, Analabha Roy wrote: I'm having some complex issues coordinating OpenMPI, SLURM, and DMTCP in my cluster. If you're looking to try checkpointing MPI applications you may want to experiment with the MANA ("MPI-Agnostic, Network-Agnostic MPI") plugin for DMTCP here: https:/

Re: [slurm-users] Slurm - UnkillableStepProgram

2023-01-19 Thread Christopher Samuel
On 1/19/23 5:01 am, Stefan Staeglich wrote: Hi, Hiya, I'm wondering where the UnkillableStepProgram is actually executed. According to Mike it has to be available on every on the compute nodes. This makes sense only if it is executed there. That's right, it's only executed on compute nodes

Re: [slurm-users] Interactive jobs using "srun --pty bash" and MPI

2022-11-02 Thread Christopher Samuel
On 11/2/22 4:45 pm, Juergen Salk wrote: However, instead of using `srun --pty bash´ for launching interactive jobs, it is now recommended to use `salloc´ and have `LaunchParameters=use_interactive_step´ set in slurm.conf. +1 on that, this is what we've been using since it landed. -- Chris Sa

Re: [slurm-users] Prolog and job_submit

2022-10-31 Thread Christopher Samuel
On 10/31/22 5:46 am, Davide DelVento wrote: Thanks for helping me find workarounds. No worries! My only other thought is that you might be able to use node features & job constraints to communicate this without the user realising. I am not sure I understand this approach. I was just tryi

Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

2022-08-03 Thread Christopher Samuel
On 8/3/22 11:47 am, Benjamin Arntzen wrote: At risk of being a heretic, why not something like Ansible to handle this? Nothing heretical about that, but for us the reason is that `scontrol reboot ASAP` is integrated nicely into the scheduling of jobs, we have health checks and node epilogs t

Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

2022-08-03 Thread Christopher Samuel
On 8/3/22 8:37 am, Phil Chiu wrote: Therefore my problem is this - "Reboot all nodes, permitting N nodes to be rebooting simultaneously." I think currently the only way to do that would be to have a script that does: * issue the `scontrol reboot ASAP nextstate=resume [...]` for 3 nodes * wa

Re: [slurm-users] Rate-limiting sbatch and srun

2022-07-19 Thread Christopher Samuel
On 7/18/22 3:45 pm, gphipps wrote: Everyone so often one of our users accidentally writes a “fork-bomb” that submits thousands of sbatch and srun requests per second. It is a giant DDOS attack on our scheduler. Is there a way of rate limiting these requests before they reach the daemon? Yes

Re: [slurm-users] How do you make --export=NONE the default behavior for our cluster?

2022-06-04 Thread Christopher Samuel
On 6/3/22 11:39 am, Ransom, Geoffrey M. wrote: Adding “--export=NONE” to the job avoids the problem, but I’m not seeing a way to change this default behavior for the whole cluster. There's an SBATCH_EXPORT environment variable that you could set for users to force that (at $JOB-1 we used to d

Re: [slurm-users] Rolling upgrade of compute nodes

2022-05-29 Thread Christopher Samuel
On 5/29/22 3:09 pm, byron wrote:  This is the first time I've done an upgrade of slurm and I had been hoping to do a rolling upgrade as opposed to waiting for all the jobs to finish on all the compute nodes and then switching across but I dont see how I can do it with this setup.  Does any on

Re: [slurm-users] upgrading slurm to 20.11

2022-05-17 Thread Christopher Samuel
On 5/17/22 12:00 pm, Paul Edmon wrote: Database upgrades can also take a while if your database is large. Definitely recommend backing up prior to upgrade as well as running slurmdbd -Dv and not the systemd daemon as if the upgrade takes a long time it will kill it preemptively due to unre

Re: [slurm-users] SLURM: reconfig

2022-05-05 Thread Christopher Samuel
On 5/5/22 7:08 am, Mark Dixon wrote: I'm confused how this is supposed to be achieved in a configless setting, as slurmctld isn't running to distribute the updated files to slurmd. That's exactly what happens with configless mode, slurmd's retrieve their config from the slurmctld, and will g

Re: [slurm-users] SLURM: reconfig

2022-05-05 Thread Christopher Samuel
On 5/5/22 5:17 am, Steven Varga wrote: Thank you for the quick reply! I know I am pushing my luck here: is it possible to modify slurm: src/common/[read_conf.c, node_conf.c] src/slurmctld/[read_config.c, ...] such that the state can be maintained dynamically? -- or cheaper to write a job manag

Re: [slurm-users] SLURM: reconfig

2022-05-04 Thread Christopher Samuel
On 5/4/22 7:26 pm, Steven Varga wrote: I am wondering what is the best way to update node changes, such as addition and removal of nodes to SLURM. The excerpts below suggest a full restart, can someone confirm this? You are correct, you need to restart slurmctld and slurmd daemons at present

Re: [slurm-users] sbatch - accept jobs above limits

2022-02-09 Thread Christopher Samuel
On 2/8/22 11:41 pm, Alexander Block wrote: I'm just discussing a familiar case with SchedMD right now (ticket 13309). But it seems that it is not possible with Slurm to submit jobs that request features/configuration that are not available at the moment of submission. Does --hold not allow t

Re: [slurm-users] sbatch - accept jobs above limits

2022-02-08 Thread Christopher Samuel
On 2/8/22 2:26 pm, z1...@arcor.de wrote: These jobs should be accepted, if a suitable node will be active soon. For example, these jobs could be in PartitionConfig. From memory if you submit jobs with the `--hold` option then you should find they are successfully accepted - I've used that in

Re: [slurm-users] Stopping new jobs but letting old ones end

2022-01-31 Thread Christopher Samuel
On 1/31/22 9:25 pm, Brian Andrus wrote: touch /etc/nologin That will prevent new logins. It's also useful that if you put a message in /etc/nologin then users who are trying to login will get that message before being denied. All the best, Chris -- Chris Samuel : http://www.csamuel.org

Re: [slurm-users] Stopping new jobs but letting old ones end

2022-01-31 Thread Christopher Samuel
On 1/31/22 9:00 pm, Christopher Samuel wrote: That would basically be the way Thinking further on this a better way would be to mark your partitions down, as it's likely you've got fewer partitions than compute nodes. All the best, Chris -- Chris Samuel : http://www.c

Re: [slurm-users] Stopping new jobs but letting old ones end

2022-01-31 Thread Christopher Samuel
On 1/31/22 4:41 pm, Sid Young wrote: I need to replace a faulty DIMM chim in our login node so I need to stop new jobs being kicked off while letting the old ones end. I thought I would just set all nodes to drain to stop new jobs from being kicked off... That would basically be the way, bu

Re: [slurm-users] Questions about scontrol reconfigure / reconfig

2022-01-16 Thread Christopher Samuel
On 1/16/22 7:41 pm, Nicolas Greneche wrote: I add a new compute node in config file so, Nodename becomes : When adding a node you need to restart slurmctld and all the slurmd's as they (currently) can only rebuild their internal structures for this at that time. This is meant to be addressed

Re: [slurm-users] Error " slurm_receive_msg_and_forward: Zero Bytes were transmitted or received"

2021-12-01 Thread Christopher Samuel
On 12/1/21 5:51 am, Gestió Servidors wrote: I can’t syncronize before with “ntpdate” because when I run “ntpdate -s my_NTP_server”, I only received message “ntpdate: no server suitable for synchronization found”… Yeah, you'll need to make sure your NTP infrastructure is working first. There

Re: [slurm-users] random allocation of resources

2021-12-01 Thread Christopher Samuel
On 12/1/21 3:27 pm, Brian Andrus wrote: If you truly want something like this, you could have a wrapper script look at available nodes, pick a random one and set the job to use that node. Alternatively you could have a cron job that adjusted nodes `weight` periodically to change which ones S

Re: [slurm-users] Job Preemption Time

2021-11-22 Thread Christopher Samuel
On 11/22/21 8:28 pm, Jeherul Islam wrote: Is there any way to configure slurm, that the High Priority job waits for a certain amount of time(say 24 hours), before it preempts the other job? Not quite, but you can set PreemptExemptTime which says how long a job must have run for before it can

Re: [slurm-users] Can't use cgroups on debian 11 : unable to get parameter 'tasks' for '/sys/fs/cgroup/cpuset/'

2021-11-16 Thread Christopher Samuel
On 11/16/21 8:04 am, Arthur Toussaint wrote: I've seen people having those kind of problems, but no one seem to be able to solve it and keep the cgroups Debian Bullseye switched to cgroups v2 by default which Slurm doesn't support yet, you'll need to switch back to the v1 cgroups. The release

Re: [slurm-users] Unable to start slurmd service

2021-11-16 Thread Christopher Samuel
On 11/16/21 7:07 am, Jaep Emmanuel wrote: > root@ecpsc10:~# scontrol show node ecpsc10 [...] >State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A [...]    Reason=Node unexpectedly rebooted [slurm@2021-11-16T14:41:04] This is why the node isn't considered available, as o

Re: [slurm-users] draining nodes due to failed killing of task?

2021-08-08 Thread Christopher Samuel
On 8/7/21 11:47 pm, Adrian Sevcenco wrote: yes, the jobs that are running have a part of file saving if they are killed, saving which depending of the target can get stuck ... i have to think for a way to take a processes snapshot when this happens .. Slurm does let you request a signal a cer

Re: [slurm-users] Users Logout when job die or complete

2021-07-10 Thread Christopher Samuel
Hi Andrea, On 7/9/21 3:50 am, Andrea Carotti wrote: ProctrackType=proctrack/pgid I suspect this is the cause of your problems, my bet is that it is incorrectly identifying the users login processes as being part of the job and thinking it needs to tidy them up in addition to any processes

Re: [slurm-users] How to avoid a feature?

2021-07-01 Thread Christopher Samuel
On 7/1/21 7:08 am, Brian Andrus wrote: I have a partition where one of the nodes has a node-locked license. That license is not used by everyone that uses the partition. This might be a case for using a reservation on that node with the MaxStartDelay flag to set the maximum amount of time (in

Re: [slurm-users] Exposing only requested CPUs to a job on a given node.

2021-07-01 Thread Christopher Samuel
On 7/1/21 3:26 pm, Sid Young wrote: I have exactly the same issue with a user who needs the reported cores to reflect the requested cores. If you find a solution that works please share. :) The number of CPUs in teh system vs the number of CPUs you can access are very different things. You c

Re: [slurm-users] Specify a gpu ID

2021-06-04 Thread Christopher Samuel
On 6/4/21 11:04 am, Ahmad Khalifa wrote: Because there are failing GPUs that I'm trying to avoid. Could you remove them from your gres.conf and adjust slurm.conf to match? If you're using cgroups enforcement for devices (ConstrainDevices=yes in cgroup.conf) then that should render them inacc

Re: [slurm-users] DMTCP or MANA with Slurm?

2021-05-28 Thread Christopher Samuel
On 5/27/21 12:26 pm, Prentice Bisbal wrote: Given the lack of traffic on the mailing list and lack of releases, I'm beginning to think that both of these project are all but abandoned. They're definitely actively working on it - I've given them a heads up on this to let them know how it's bei

Re: [slurm-users] Drain node from TaskProlog / TaskEpilog

2021-05-24 Thread Christopher Samuel
On 5/24/21 3:02 am, Mark Dixon wrote: Does anyone have advice on automatically draining a node in this situation, please? We do some health checks via a node epilog set with the "Epilog" setting, including queueing node reboots with "scontrol reboot". All the best, Chris -- Chris Samuel

Re: [slurm-users] inconsistent CUDA_VISIBLE_DEVICES with srun vs sbatch

2021-05-20 Thread Christopher Samuel
On 5/19/21 1:41 pm, Tim Carlson wrote: but I still don't understand how with "shared=exclusive" srun gives one result and sbatch gives another. I can't either, but I can reproduce it with Slurm 20.11.7. :-/ -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] nodes going to down* and getting stuck in that state

2021-05-20 Thread Christopher Samuel
On 5/19/21 9:15 pm, Herc Silverstein wrote: Does anyone have an idea of what might be going on? To add to the other suggestions, I would say that checking the slurmctld and slurmd logs to see what it is saying is wrong is a good place to start. Best of luck, Chris -- Chris Samuel : http

Re: [slurm-users] Determining Cluster Usage Rate

2021-05-14 Thread Christopher Samuel
On 5/14/21 1:45 am, Diego Zuccato wrote: Usage reported in Percentage of Total   Cluster  TRES Name    Allocated Down PLND Dow    Idle Reserved Reported - --

Re: [slurm-users] Determining Cluster Usage Rate

2021-05-14 Thread Christopher Samuel
On 5/14/21 1:45 am, Diego Zuccato wrote: It just doesn't recognize 'ALL'. It works if I specify the resources. That's odd, what does this say? sreport --version All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] Determining Cluster Usage Rate

2021-05-13 Thread Christopher Samuel
On 5/13/21 3:08 pm, Sid Young wrote: Hi All, Hiya, Is there a way to define an effective "usage rate" of a HPC Cluster using the data captured in the slurm database. Primarily I want to see if it can be helpful in presenting to the business a case for buying more hardware for the HPC  :)

Re: [slurm-users] Grid engine slaughtering parallel jobs when any one of them fails (copy)

2021-04-19 Thread Christopher Samuel
Hi Robert, On 4/16/21 12:39 pm, Robert Peck wrote: Please can anyone suggest how to instruct SLURM not to massacre ALL my jobs because ONE (or a few) node(s) fails? You will also probably want this for your srun: --kill-on-bad-exit=0 What does the scontrol command below show? scontrol show

Re: [slurm-users] PartitionName default

2021-04-07 Thread Christopher Samuel
On 4/7/21 11:48 am, Administração de Sistemas do Centro de Bioinformática wrote: Unfortunately, I still don't know how to use any other value to PartitionName. We've got about 20 different partitions on our large Cray system, with a variety of names (our submit filter system directs jobs to

Re: [slurm-users] Rate Limiting of RPC calls

2021-02-09 Thread Christopher Samuel
On 2/9/21 5:08 pm, Paul Edmon wrote: 1. Being on the latest release: A lot of work has gone into improving RPC throughput, if you aren't running the latest 20.11 release I highly recommend upgrading.  20.02 also was pretty good at this. We've not gone to 20.11 on production systems yet, but I

Re: [slurm-users] only 1 job running

2021-01-28 Thread Christopher Samuel
On 1/27/21 9:28 pm, Chandler wrote: Hi list, we have a new cluster setup with Bright cluster manager. Looking into a support contract there, but trying to get community support in the mean time.  I'm sure things were working when the cluster was delivered, but I provisioned an additional node

Re: [slurm-users] Job not running with Resource Reason even though resources appear to be available

2021-01-26 Thread Christopher Samuel
On 1/24/21 8:39 am, Paul Raines wrote: I think you have identified the issue here or are very close.  My gres.conf on the rtx-04 node for example is: AutoDetect=nvml Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia0 Cores=0-15 [...] Ah - you are doing both autodiscovery here and also specifyin

Re: [slurm-users] Building Slurm RPMs with NVIDIA GPU support?

2021-01-26 Thread Christopher Samuel
On 1/26/21 12:10 pm, Ole Holm Nielsen wrote: What I don't understand is, is it actually *required* to make the NVIDIA libraries available to Slurm?  I didn't do that, and I'm not aware of any problems with our GPU nodes so far.  Of course, our GPU nodes have the libraries installed and the /de

Re: [slurm-users] Defining an empty partition

2021-01-05 Thread Christopher Samuel
On 12/18/20 4:45 am, Tina Friedrich wrote: Yeah, I had that problem as well (trying to set up a partition that didn't have any nodes - they're not here yet). You can define nodes in Slurm that don't exist yet with State=FUTURE, that means slurmctld basically ignores them until you change that

Re: [slurm-users] Scripts run slower in slurm?

2020-12-15 Thread Christopher Samuel
On 12/14/20 11:20 pm, Alpha Experiment wrote: It is called using the following submission script: #!/bin/bash #SBATCH --partition=full #SBATCH --job-name="Large" source testenv1/bin/activate python3 multithread_example.py You're not asking for a number of cores, so you'll likely only be getti

Re: [slurm-users] Trouble installing slurm-20.02.4-1.amzn2.x86_64 libnvidia-ml.so.1

2020-12-04 Thread Christopher Samuel
Hi Drew, On 12/4/20 11:32 am, Mullen, Drew wrote: Error: Package: slurm-20.02.4-1.amzn2.x86_64 (/slurm-20.02.4-1.amzn2.x86_64)    Requires: libnvidia-ml.so.1()(64bit That looks like it's fixed in 20.02.5 (the current release is 20.02.6): ---

Re: [slurm-users] update_node / reason set to: slurm.conf / state set to DRAINED

2020-11-05 Thread Christopher Samuel
Hi Kevin, On 11/4/20 6:00 pm, Kevin Buckley wrote: In looking at the SlurmCtlD log we see pairs of lines as follows  update_node: node nid00245 reason set to: slurm.conf  update_node: node nid00245 state set to DRAINED I'd go looking in your healthcheck scripts, I took a quick look at the

Re: [slurm-users] Slurm Upgrade

2020-11-04 Thread Christopher Samuel
Hi Navin, On 11/4/20 10:14 pm, navin srivastava wrote: I have already built a new server slurm 20.2 with the latest DB. my question is,  shall i do a mysqldump into this server from existing server running with version slurm version 17.11.8 This won't work - you must upgrade your 17.11 datab

Re: [slurm-users] Nodes not returning from DRAINING

2020-10-28 Thread Christopher Samuel
On 10/28/20 6:27 am, Diego Zuccato wrote: Strangely the core file seems corrupted (maybe because it's from a 4-nodes job and they all try to write to the same file?): You can set a pattern for core file names to prevent that, usually the PID is in the name, but you can put the hostname in the

Re: [slurm-users] pam_slurm_adopt always claims now active jobs even when they do

2020-10-23 Thread Christopher Samuel
Hi Paul, On 10/23/20 10:13 am, Paul Raines wrote: Any clues as to why pam_slurm_adopt thinks there is no job? Do you have PrologFlags=Contain in your slurm.conf? Contain At job allocation time, use the ProcTrack plugin to create a job container on all allocated compute nodes. This co

Re: [slurm-users] SLES 15 rpmbuild from 20.02.5 tarball wants munge-libs: system munge RPMs don't provide it

2020-10-22 Thread Christopher Samuel
On 10/21/20 6:32 pm, Kevin Buckley wrote: If you install SLES 15 SP1 from the Q2 ISOs so that you have Munge but not the Slurm 18 that comes on the media, and then try to "rpmbuild -ta" against a vanilla Slurm 20.02.5 tarball, you should get the error I did. Ah, yes, that looks like it was a p

Re: [slurm-users] [External] Limit usage outside reservation

2020-10-22 Thread Christopher Samuel
On 10/22/20 12:20 pm, Burian, John wrote: This doesn' t help you now, but Slurm 20.11 is expected to have "magnetic reservations," which are reservations that will adopt jobs that don't specify a reservation but otherwise meet the restrictions of the reservation: Magnetic reservations are in

Re: [slurm-users] SLES 15 rpmbuild from 20.02.5 tarball wants munge-libs: system munge RPMs don't provide it

2020-10-20 Thread Christopher Samuel
On 10/20/20 12:49 am, Kevin Buckley wrote: only have, as listed before, Munge 0.5.13. I guess the question is (going back to your initial post): > error: Failed build dependencies: >munge-libs is needed by slurm-20.02.5-1.x86_64 Had you installed libmunge2 before trying this build?

Re: [slurm-users] SLES 15 rpmbuild from 20.02.5 tarball wants munge-libs: system munge RPMs don't provide it

2020-10-19 Thread Christopher Samuel
On 10/19/20 7:15 pm, Kevin Buckley wrote: [...] Just out of interest though, when you built yours on CLE7.0 UP01, what provided the munge: the vannila SLES munge, or a Cray munge ? It's cray-munge for CLE7 UP01. Thanks for the explanation of what you've been running through! I forgot I do ha

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Christopher Samuel
Hi Sajesh, On 10/8/20 4:18 pm, Sajesh Singh wrote: Thank you for the tip. That works as expected. No worries, glad it's useful. Do be aware that the core bindings for the GPUs would likely need to be adjusted for your hardware! Best of luck, Chris -- Chris Samuel : http://www.csamuel

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Christopher Samuel
On 10/8/20 3:48 pm, Sajesh Singh wrote: Thank you. Looks like the fix is indeed the missing file /etc/slurm/cgroup_allowed_devices_file.conf No, you don't want that, that will allow all access to GPUs whether people have requested them or not. What you want is in gres.conf and looks lik

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Christopher Samuel
Hi Sajesh, On 10/8/20 11:57 am, Sajesh Singh wrote: debug:  common_gres_set_env: unable to set env vars, no device files configured I suspect the clue is here - what does your gres.conf look like? Does it list the devices in /dev for the GPUs? All the best, Chris -- Chris Samuel : http:/

Re: [slurm-users] Current status of checkpointing

2020-08-14 Thread Christopher Samuel
On 8/14/20 6:17 am, Stefan Staeglich wrote: what's the current status of the checkpointing support in SLURM? There isn't any these days, there used to be support for BLCR but that's been dropped as BLCR is no more. I know from talking with SchedMD they are of the opinion that any current c

Re: [slurm-users] Reservation vs. Draining for Maintenance?

2020-08-06 Thread Christopher Samuel
On 8/6/20 10:13 am, Jason Simms wrote: Later this month, I will have to bring down, patch, and reboot all nodes in our cluster for maintenance. The two options available to set nodes into a maintenance mode seem to be either: 1) creating a system-wide reservation, or 2) setting all nodes into

Re: [slurm-users] cgroup limits not created for jobs

2020-07-26 Thread Christopher Samuel
On 7/26/20 12:21 pm, Paul Raines wrote: Thank you so much.  This also explains my GPU CUDA_VISIBLE_DEVICES missing problem in my previous post. I've missed that, but yes, that would do it. As a new SLURM admin, I am a bit suprised at this default behavior. Seems like a way for users to game

  1   2   3   >