[slurm-users] sllurmrestd via unix socket

2025-04-10 Thread Brian Andrus via slurm-users
8,   "mode": "backup"     } Other commands fail with:   "error_number": 1007,   "error": "Protocol authentication error", I'll admit, I don't usually use sockets, so I could easily be overlooking something there. Permissions on the socket look right. I am getting json back, so it is connecting. Note: slurmrestd is running under it's own user (not root and not slurmuser). Any ideas? Thanks in advance, Brian Andrus -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: 回复: Re: how to set slurmdbd.conf if using two slurmdb node with HA database?

2025-02-20 Thread Brian Andrus via slurm-users
, this generally gives ample time to recover without issue. Brian Andrus On 2/20/2025 6:45 PM, hermes via slurm-users wrote: Thank you for your insightful suggestions. Placing both slurmdbd and slurmctld on the same node is indeed a new structure  that we hadn’t considered before, and it

[slurm-users] Re: How to clean up?

2025-02-04 Thread Brian Andrus via slurm-users
emons are down, then start the first. Once it is up (you can run scontrol show config) start the second. Run 'scontrol show config' again and you should see both daemons listed as 'up at the end of the output. -Brian Andrus On 2/3/2025 7:29 PM, Steven Jones via slurm-users wrot

[slurm-users] Re: Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions

2025-01-04 Thread Brian Andrus via slurm-users
ts to see what it is asking for that does not exist 'scontrol show job xxx' Brian Andrus On 1/4/2025 3:41 AM, John Hearns via slurm-users wrote: Output of sinfo and squeue Look at slurmd log in an example node also Tail -f is your friend On Sat, Jan 4, 2025, 8:13 AM sportlecon spor

[slurm-users] Re: All GPUs are Usable if no Gres is Defined

2025-01-04 Thread Brian Andrus via slurm-users
Ensure cgroups is working and configured to limit access to devices (which includes gpus). Check your cgroup.conf to see that there is an entry for:     ConstrainDevices=yes Brian Andrus On 1/3/2025 10:49 AM, Groner, Rob via slurm-users wrote: I'm not entirely sure, and I can't

[slurm-users] multiple conf-server entries for sackd

2024-12-03 Thread Brian Andrus via slurm-users
light about that. Brian Andrus -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: sinfo not listing any partitions

2024-12-02 Thread Brian Andrus via slurm-users
You only have one partition named 'default' You are not allowed to name it that. Name it something else and you should be good. Brian Andrus On 11/28/2024 6:52 AM, Patrick Begou via slurm-users wrote: Hi Kent, on your management node could you run: systemctl status slurmctld and

[slurm-users] Re: Change primary alloc node

2024-11-03 Thread Brian Andrus via slurm-users
cket, numa, board, and node. Brian Andrus On 11/3/2024 12:06 AM, Bhaskar Chakraborty wrote: Hi Brian, Thanks for the response! However, this particular approach where we need to accept whatever slurm gives us as starting node and deal with it accordingly doesn’t work for us. I think there

[slurm-users] Re: Change primary alloc node

2024-10-31 Thread Brian Andrus via slurm-users
stuff here *)     Run all other stuff here esac Takes some coding effort but keeps control of the processes within your own code. Brian Andrus On 10/30/24 09:35, Bhaskar Chakraborty via slurm-users wrote: Hi, Is there a way to change/control the primary node (i.e. where the initial task start

[slurm-users] Re: How do you guys track which GPU is used by which job ?

2024-10-16 Thread Brian Andrus via slurm-users
files that map GPUs to HPC jobs./ It does go on to show the conventions/format of the files. I imagine you could have some bits in a prologue script that creates that as the job starts on the node and point dcgm-exporter there. Brian Andrus On 10/16/24 06:10, Sylvain MARET via slurm-users

[slurm-users] Re: what updates NODEADDR

2024-09-21 Thread Brian Andrus via slurm-users
IIRC, you need to ensure reverse lookup for DNS matches your nodename Brian Andrus On 9/20/2024 4:55 PM, Jakub Szarlat via slurm-users wrote: Hi I'm using dynamic nodes with "slurmd -Z" with slurm 23.11.1. Firstly I find that when you do "scontrol show node" it

[slurm-users] Re: salloc not starting shell despite LaunchParameters=use_interactive_step

2024-09-06 Thread Brian Andrus via slurm-users
Folks have addressed the obvious config settings, but also check your prolog/epilog scripts/settings as well as the .bashrc/.bash_profile and stuff in /etc/profile.d/ That may be hanging it up. Brian Andrus On 9/5/2024 5:17 AM, Loris Bennett via slurm-users wrote: Hi, With $ salloc

[slurm-users] Re: Bug? sbatch not respecting MaxMemPerNode setting

2024-09-04 Thread Brian Andrus via slurm-users
others did nothing. Brian Andrus On 9/4/2024 1:37 AM, Angel de Vicente via slurm-users wrote: Hello, we found an issue with Slurm 24.05.1 and the MaxMemPerNode setting. Slurm is installed in a single workstation, and thus, the number of nodes is just 1. The relevant sections in slurm.conf read

[slurm-users] Re: playing with --nodes=

2024-08-30 Thread Brian Andrus via slurm-users
They are more than happy to do that. Brian Andrus On 8/29/2024 11:48 PM, Matteo Guglielmi via slurm-users wrote: I'm sorry, but I still don't get it. Isn't --nodes=2,4 telling slurm to allocate 2 OR 4 nodes and nothing else? So, if: --nodes=2 allocates only two nodes

[slurm-users] Re: playing with --nodes=

2024-08-29 Thread Brian Andrus via slurm-users
. Slurm does not give you 4 nodes because you only want 3 tasks. You see the result in your variables: SLURM_NNODES=3 SLURM_JOB_CPUS_PER_NODE=1(x3) If you only want 2 nodes, make --nodes=2. Brian Andrus On 8/29/24 08:00, Matteo Guglielmi via slurm-users wrote: Hi, On sbatch's manpage

[slurm-users] Re: playing with --nodes=

2024-08-29 Thread Brian Andrus via slurm-users
logs and check your conf see what your defaults are. Brian Andrus On 8/29/2024 5:04 AM, Matteo Guglielmi via slurm-users wrote: Hello, I have a cluster with four Intel nodes (node[01-04], Feature=intel) and four Amd nodes (node[05-08], Feature=amd). # job file #SBATCH --ntasks=3 #SBATCH

[slurm-users] Re: Unable to run sequential jobs simultaneously on the same node

2024-08-19 Thread Brian Andrus via slurm-users
d, which would help. If they are all exiting with exit code 9, you need to look at the code for your a.out to see what code 9 means, as that is who is raising that error. Slurm sees that and if it is non-zero, it interprets it as a failed job. Brian Andrus On 8/19/2024 12:50 AM, Arko Roy v

[slurm-users] Re: Upgrade compute node to 24.05.2

2024-08-15 Thread Brian Andrus via slurm-users
and ensure slurmd is happier. Brian Andrus On 8/14/24 17:52, Sid Young via slurm-users wrote: G'Day all, I've been upgrading cmy cluster from 20.11.0 in small steps to get to 24.05.2. Currently 1 have all nodes on 23.02.8, the controller on 24.05.2 and a single test node on 24.05.

[slurm-users] Re: Find out submit host of past job?

2024-08-07 Thread Brian Andrus via slurm-users
If you need it, you could add it to either prologue or epilogue to store the info somewhere. I do that for the scripts themselves and keep the past two weeks backed up so we can debug if/when there is an issue. Brian Andrus On 8/7/2024 6:29 AM, Steffen Grunewald via slurm-users wrote: On

[slurm-users] Re: LRMS error: (-1) Job missing from SLURM."

2024-08-06 Thread Brian Andrus via slurm-users
to be from a front-end system that interfaces with slurm and does not seem to show the actual slurm jobid, unless those are the 274398, 274399, and 274400 numbers. If so, you could look in the slurmctld logs for the jobs to see what may have happened. Brian Andrus On 8/6/2024 5:57 AM, Felix via

[slurm-users] Re: Background tasks in Slurm scripts?

2024-07-26 Thread Brian Andrus via slurm-users
Generally speaking, when the batch script exits, slurm will clean up (ie kill) any stray processes. So, I would expect that executable to be killed at cleanup. Brian Andrus On 7/26/2024 2:45 AM, Steffen Grunewald via slurm-users wrote: On Fri, 2024-07-26 at 10:42:45 +0300, Slurm users wrote

[slurm-users] Re: CLOUD nodes with unknown IP addresses

2024-07-19 Thread Brian Andrus via slurm-users
Martin, In a nutshell, when slurmd starts, it tells that info to slurmctld. That is the "registration" event mentioned. Brian Andrus On 7/19/2024 5:44 AM, Martin Lee via slurm-users wrote: I've read the following in the slurm power saving docs: https://slurm.schedmd.com/

[slurm-users] Re: SLURM noob administrator question

2024-07-11 Thread Brian Andrus via slurm-users
You probably want to look at scontrol show node and scontrol show job for that node and the jobs on it. Compare those and you may find someone requested most all the resources, but are not running them properly. Look at the job itself to see what it is trying to do. Brian Andrus On 7/11

[slurm-users] Re: Nodes TRES double what is requested

2024-07-10 Thread Brian Andrus via slurm-users
Jack, To make sure things are set right, run 'slurmd -C' on the node and use that output in your config. It can also give you insight as to what is being seen on the node versus what you may expect. Brian Andrus On 7/10/2024 1:25 AM, jack.mellor--- via slurm-users wrote: H

[slurm-users] Re: Using sharding

2024-07-04 Thread Brian Andrus via slurm-users
Just a thought. Try specifying some memory. It looks like the running jobs do that and by default, if not specified it is "all the memory on the node", so it can't start because some of it is taken. Brian Andrus On 7/4/2024 9:54 AM, Ricardo Cruz wrote: Dear Brian, Curre

[slurm-users] Re: Using sharding

2024-07-04 Thread Brian Andrus via slurm-users
at that moment. Brian Andrus On 7/4/2024 8:43 AM, Ricardo Cruz via slurm-users wrote: Greetings, There are not many questions regarding GPU sharding here, and I am unsure if I am using it correctly... I have configured it according to the instructions <https://slurm.schedmd.com/gres.html>

[slurm-users] Re: How can I tell the OS that was used to build SLURM?

2024-06-20 Thread Brian Andrus via slurm-users
going to run in. Because there are multiple possible dependencies/uses, this is best. Brian Andrus On 6/20/2024 1:38 PM, Carl Ponder via slurm-users wrote: We're seeing SLURM mis-behaving on one of your clusters, that runs Ubuntu 22.04. Ampng other problems, we see an error-me

[slurm-users] Re: Can Not Use A Single GPU for Multiple Jobs

2024-06-20 Thread Brian Andrus via slurm-users
Well, if I am reading this right, it makes sense. Every job will need at least 1 core just to run and if there are only 4 cores on the machine, one would expect a max of 4 jobs to run. Brian Andrus On 6/20/2024 5:24 AM, Arnuld via slurm-users wrote: I have a machine with a quad-core CPU and

[slurm-users] Re: slurmdbd not connecting to mysql (mariadb)

2024-05-30 Thread Brian Andrus via slurm-users
That SIGTERM message means something is telling slurmdbd to quit. Check your cron jobs, maintenance scripts, etc. Slurmdbd is being told to shutdown. If you are running in the foreground, a ^C does that. If you run a kill or killall on it, you will get that same message. Brian Andrus On 5

[slurm-users] Re: slurmdbd archive format

2024-05-28 Thread Brian Andrus via slurm-users
Oh, to address the passed train: Restore the archive data with "sacctmgr archive load", then you can do as you need. From man sacctmgr: *archive*{dump|load}     Write database information to a flat file or load information that has previously been written to a file. Brian Andr

[slurm-users] Re: slurmdbd archive format

2024-05-28 Thread Brian Andrus via slurm-users
Instead of using the archive files, couldn't you query the db directly for the info you need? I would recommend sacct/sreport if those can get the info you need. Brian Andrus On 5/28/2024 9:59 AM, O'Neal, Doug (NIH/NCI) [C] via slurm-users wrote: My organization needs to access hi

[slurm-users] Re: Building Slurm debian package vs building from source

2024-05-23 Thread Brian Andrus via slurm-users
I would guess either you install GPU drivers on the non-GPU nodes or build slurm without GPU support for that to work due to package dependencies. Both viable options. I have done installs where we just don't compile GPU support in and that is left to the users to manage. Brian Andrus

[slurm-users] Re: Building Slurm debian package vs building from source

2024-05-22 Thread Brian Andrus via slurm-users
versions are compatible, they can work together. You will need to be aware of differences for jobs and configs, but it is possible. Brian Andrus On 5/22/2024 12:45 AM, Arnuld via slurm-users wrote: We have several nodes, most of which have different Linux distributions (distro for short

[slurm-users] Re: Submitting from an untrusted node

2024-05-14 Thread Brian Andrus via slurm-users
Rike, Assuming the data, scripts and other dependencies are already on the cluster, you could just ssh and execute the sbatch command in a single shot: ssh submitnode sbatch some_script.sh It will ask for a password if appropriate and could use ssh keys to bypass that need. Brian Andrus

[slurm-users] Re: Integrating Slurm with WekaIO

2024-04-19 Thread Brian Andrus via slurm-users
added an override file, that will affect things. Brian Andrus On 4/19/2024 10:15 AM, Jeffrey Layton wrote: I like it, however, it was working before without a slurm.conf in /etc/slurm. Plus the environment variable SLURM_CONF is pointing to the correct slurm.conf file (the one in /cm

[slurm-users] Re: Integrating Slurm with WekaIO

2024-04-19 Thread Brian Andrus via slurm-users
nf in /etc/slurm/ on the node(s). Brian Andrus On 4/19/2024 9:56 AM, Jeffrey Layton via slurm-users wrote: Good afternoon, I'm working on a cluster of NVIDIA DGX A100's that is using BCM 10 (Base Command Manager which is based on Bright Cluster Manager). I ran into an error and only jus

[slurm-users] Re: Slurm.conf and workers

2024-04-15 Thread Brian Andrus via slurm-users
will want to sync the config across all nodes and then 'scontrol reconfigure' You may want to look into configless if you can set DNS entries and your config is basically monolithic or all parts are in /etc/slurm/ Brian Andrus On 4/15/2024 2:55 AM, Xaver Stiensmeier via slurm-use

[slurm-users] Re: Upgrading nodes

2024-04-10 Thread Brian Andrus via slurm-users
Yes. You can build the 8 rpms on 9. Look at 'mock' to do so. I did similar when I still had to support EL7 Fairly generic plan, the devil is in the details and verifying each step, but those are the basic bases you need to touch. Brian Andrus On 4/10/2024 1:48 PM, Steve Berg

[slurm-users] Re: Elastic Computing: Is it possible to incentivize grouping power_up calls?

2024-04-08 Thread Brian Andrus via slurm-users
path to look at. Brian Andrus On 4/8/2024 6:10 AM, Xaver Stiensmeier via slurm-users wrote: Dear slurm user list, we make use of elastic cloud computing i.e. node instances are created on demand and are destroyed when they are not used for a certain amount of time. Created instances are set up

[slurm-users] Re: controller backup slurmctld error while takeover

2024-03-25 Thread Brian Andrus via slurm-users
name it has in the slurm.conf file. Also, a quick way to do the failover check is to run (from the backup controller): scontrol takeover Brian Andrus On 3/25/2024 1:39 PM, Miriam Olmi wrote: Hi Brian, Thanks for replying. In my first message I forgot to specify that the primary and the

[slurm-users] Re: controller backup slurmctld error while takeover

2024-03-25 Thread Brian Andrus via slurm-users
Quick correction, it is SaveStateLocation not SlurmSaveState. Brian Andrus On 3/25/2024 8:11 AM, Miriam Olmi via slurm-users wrote: Dear all, I am having trouble finalizing the configuration of the backup controller for my slurm cluster. In principle, if no job is running everything seems

[slurm-users] Re: controller backup slurmctld error while takeover

2024-03-25 Thread Brian Andrus via slurm-users
Miriam, You need to ensure the SlurmSaveState directory is the same for both. And by 'the same', I mean all contents are exactly the same. This is usually achieved by using a shared drive or replication. Brian Andrus On 3/25/2024 8:11 AM, Miriam Olmi via slurm-users wrote: Dear

[slurm-users] Re: We're Live! Check out the new SchedMD.com now!

2024-03-13 Thread Brian Andrus via slurm-users
Wow, snazzy! Looks very good. My compliments. Brian Andrus On 3/12/2024 11:24 AM, Victoria Hobson via slurm-users wrote: Our website has gone through some much needed change and we'd love for you to explore it! The new SchedMD.com is equipped with the latest information about Slurm,

[slurm-users] Re: Slurm billback and sreport

2024-03-04 Thread Brian Andrus via slurm-users
Chip, I use 'sacct' rather than sreport and get individual job data. That is ingested into a db and PowerBI, which can then aggregate as needed. sreport is pretty general and likely not the best for accurate chargeback data. Brian Andrus On 3/4/2024 6:09 AM, Chip Seraphine via s

[slurm-users] Re: Is SWAP memory mandatory for SLURM

2024-03-04 Thread Brian Andrus via slurm-users
Joseph, You will likely get many perspectives on this. I disable swap completely on our compute nodes. I can be draconian that way. For the workflow supported, this works and is a good thing. Other workflows may benefit from swap. Brian Andrus On 3/3/2024 11:04 PM, John Joseph via slurm

[slurm-users] Re: [ext] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-28 Thread Brian Andrus via slurm-users
oxy> Brian Andrus On 2/28/2024 12:54 PM, Dan Healy wrote: Are most of us using HAProxy or something else? On Wed, Feb 28, 2024 at 3:38 PM Brian Andrus via slurm-users wrote: Magnus, That is a feature of the load balancer. Most of them have that these days. Brian Andrus

[slurm-users] Re: [ext] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-28 Thread Brian Andrus via slurm-users
Magnus, That is a feature of the load balancer. Most of them have that these days. Brian Andrus On 2/28/2024 12:10 AM, Hagdorn, Magnus Karl Moritz via slurm-users wrote: On Tue, 2024-02-27 at 08:21 -0800, Brian Andrus via slurm-users wrote: for us, we put a load balancer in front of the

[slurm-users] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-27 Thread Brian Andrus via slurm-users
disconnection for any reason even for X-based apps. Personally, I don't care much for interactive sessions in HPC, but there is a large body that only knows how to do things that way, so it is there. Brian Andrus On 2/26/2024 12:27 AM, Josef Dvoracek via slurm-users wrote: What is the recommende

[slurm-users] Re: [INTERNET] Re: question on sbatch --prefer

2024-02-10 Thread Brian Andrus via slurm-users
I imagine you could create a reservation for the node and then when you are completely done, remove the reservation. Each helper could then target the reservation for the job. Brian Andrus On 2/9/2024 5:52 PM, Alan Stange via slurm-users wrote: Chip, Thank you for your prompt response.  We

Re: [slurm-users] sinfo: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host

2024-01-26 Thread Brian Andrus
look elsewhere. Brian Andrus On 1/26/2024 6:38 AM, Michael Lewis wrote: Hi All, I’m trying to get slurm-23.11.3 running on Ubuntu 20.04 and running on a stand alone system.  I’m running into an issue I can not find the answer to.  After compiling and installing when I fire up slurmctld

Re: [slurm-users] Suspend/Resume request limit

2024-01-17 Thread Brian Andrus
While I am not sure of your specifics, you could easily add lines to your suspend/resume scripts to check/wait/etc if there are tasks waiting. Brian Andrus On 1/15/2024 12:22 AM, 김종록 wrote: Hello. I'm going to use Slurm's cloud feature in private cloud. The problem is that the

Re: [slurm-users] install new slurm, no slurmctld found

2023-12-16 Thread Brian Andrus
a submit/login node. Brian Andrus On 12/15/2023 2:00 AM, Felix wrote: Hello we are installing a new server with slurm on ALMA Linux 9.2 we did the followimg: dnf install slurm The result is rpm -qa | grep slurm slurm-libs-22.05.9-1.el9.x86_64 slurm-22.05.9-1.el9.x86_64 Now when trying to

Re: [slurm-users] SlurmdSpoolDir full

2023-12-09 Thread Brian Andrus
filled on the node. You can run 'df -h' and see some info that would get you started. Brian Andrus On 12/8/2023 7:00 AM, Xaver Stiensmeier wrote: Dear slurm-user list, during a larger cluster run (the same I mentioned earlier 242 nodes), I got the error "SlurmdSpoolDir full". T

Re: [slurm-users] slurm power save question

2023-11-29 Thread Brian Andrus
nality of being able to have "keep at least X nodes up and idle" would be nice, that is not how I see this documented or working. Brian Andrus On 11/23/2023 5:12 AM, Davide DelVento wrote: Thanks for confirming, Brian. That was my understanding as well. Do you have it working that

Re: [slurm-users] slurm power save question

2023-11-22 Thread Brian Andrus
As I understand it, that setting means "Always have at least X nodes up", which includes running jobs. So it stops any wait time for the first X jobs being submitted, but any jobs after that will need to wait for the power_up sequence. Brian Andrus On 11/22/2023 6:58 AM, David

Re: [slurm-users] partition qos without managing users

2023-11-22 Thread Brian Andrus
Eg, Could you be more specific as to what you want? Is there a specific user you want to control, or no user should get more than x cpus in the partition? Or no single job should get more than x cpus? The details matter to determine the right approach and right settings. Brian Andrus On 11

Re: [slurm-users] partition qos without managing users

2023-11-20 Thread Brian Andrus
your slurm users belong to and add them to slurmdbd. Once they are in there, you can set defaults with exceptions for specific users. If you are only looking to have settings apply to all users, you don't have to import the users. Set the QoS for the partition. Brian Andrus On 11/20/2023 1:

Re: [slurm-users] slurm job_container/tmpfs

2023-11-20 Thread Brian Andrus
How do you 'manually create a directory'? That would be when the ownership of root would be occurring. After creating it, you can chown/chmod it as well. Brian Andrus On 11/18/2023 7:35 AM, Arsene Marian Alain wrote: Dear slurm community, I run slurm 21.08.1 under Rocky Linux

Re: [slurm-users] Slurm Rest API error

2023-06-28 Thread Brian Andrus
Vlad, Actually, it looks like it is working. You are using v0.39 for the parser, which is trying to use OpenAPI calls. Unless you compiled with OpenAPI, that won't work. Try using the 0.37 version and you may see a simpler result that is successful. Brian Andrus On 6/28/2023 11:

Re: [slurm-users] Backfill Scheduling

2023-06-26 Thread Brian Andrus
can squeeze in before the additional node for Job B is expected to be available, so it runs on the idle node. Brian Andrus On 6/26/2023 3:48 PM, Reed Dier wrote: Hoping this will be an easy one for the community. The priority schema was recently reworked for our cluster, with only Priori

Re: [slurm-users] federation vs multi-cluster

2023-06-26 Thread Brian Andrus
ineate what node can do what (a node-locked license, for example). Then you can send a job to a specific subset of nodes. Quite a few other ways to design the ability you describe, but separate clusters is not one of them. Brian Andrus On 6/26/2023 6:11 AM, mohammed shambakey wrote: Hi Jus

Re: [slurm-users] monitoring and accounting

2023-06-12 Thread Brian Andrus
Second that. Prometheus+slurm exporter+grafana works great. Brian Andrus On 6/12/2023 8:20 AM, Josef Dvoracek wrote: > But I'd be interested to see what other places do. we installed this: https://github.com/vpenso/prometheus-slurm-exporter and scrape this exporter with "inpu

Re: [slurm-users] Can't get --reboot to work at all with slurm-23.02?

2023-06-07 Thread Brian Andrus
Make sure you have configured the RebootProgram in slurm.conf, that it exists on the nodes and is executable by the user. This is usually /sbin/reboot Brian Andrus On 6/7/2023 7:50 AM, Heinz, Michael wrote: Hey, all. So I added slurmdbd to our slurm-23.02 install and made my account an

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Brian Andrus
That output of slurmd -C is your answer. Slurmd only sees 6GB of memory and you are claiming it has 10GB. I would run some memtests, look at meminfo on the node, etc. Maybe even check that the type/size of memory in there is what you think it is. Brian Andrus On 5/25/2023 7:30 AM, Roger

Re: [slurm-users] Slurm 22.05.8 - salloc not starting shell on remote host

2023-05-19 Thread Brian Andrus
Defaulting to a shell for salloc is a newer feature. For your version, you should:     srun -n 1 -t 00:10:00 --mem=1G --pty bash Brian Andrus On 5/19/2023 8:24 AM, Ryan Novosielski wrote: I’m not at a computer, and we run an older version of Slurm yet so I can’t say with 100% confidence that

Re: [slurm-users] On the ability of coordinators

2023-05-17 Thread Brian Andrus
jobs are running on. Brian Andrus On 5/17/2023 10:49 AM, Groner, Rob wrote: I'm not sure what you mean by "if they have the permissions". I'm talking about someone who is specifically designated as "coordinator" of an account in slurm.  With that designation

Re: [slurm-users] On the ability of coordinators

2023-05-17 Thread Brian Andrus
you need to preempt running jobs, that would take a bit more effort to set up, but is an alternative. Brian Andrus On 5/17/2023 6:40 AM, Groner, Rob wrote: I was asked to see if coordinators could do anything in this scenario: * Within the account that they coordinated, User A submitted 1000s

Re: [slurm-users] Prevent CLOUD node from being shutdown after startup

2023-05-12 Thread Brian Andrus
ut counts in a comma separated list (e.g "nid[10-20]:4,nid[80-90]:2"). By default no nodes are excluded. This value may be updated with scontrol. See ReconfigFlags=KeepPowerSaveSettings for setting persistence. Brian Andrus On 5/12/2023 2:35 AM, Xaver Stiensmeier wrote: D

Re: [slurm-users] monitoring and accounting

2023-05-05 Thread Brian Andrus
Something I have been impressed with is Netdata It is in the standard repositories and will auto-detect quite a bit of things on a node. It is great for real-time monitoring of a node/job. I also use Prometheus and Grafana for historic data (anything over 5 minutes). Brian Andrus On 5/5

Re: [slurm-users] Several slurmdbds against one mysql server?

2023-04-30 Thread Brian Andrus
ente wrote: Hello, Brian Andrus writes: Ole is spot on with his federated suggestion. That is exactly what fits the bill for you, given your requirements. You can have everything you want, but you don't get to have it how you want (separate databases). When/If you looked deeper into it, you wi

Re: [slurm-users] Several slurmdbds against one mysql server?

2023-04-30 Thread Brian Andrus
different part of the world and trying to federate them in a performant manner was prohibitively expensive. Brian Andrus On 4/29/2023 10:53 PM, Angel de Vicente wrote: Hi Ole, Ole Holm Nielsen writes: Maybe you want to use Slurm federated clusters with a single database thanks for

Re: [slurm-users] Slurmdbd High Availability

2023-04-13 Thread Brian Andrus
the HA database. One would be primary and the other a failover (AccountingStorageBackupHost). Although, technically, they would both be able to be active at the same time. Brian Andrus On 4/13/2023 2:49 AM, Shaghuf Rahman wrote: Hi, I am setting up Slurmdb in my system and I need some inputs

Re: [slurm-users] Odd prolog Error?

2023-04-11 Thread Brian Andrus
user exists on the node, however you are propagating the uids. Brian ANdrus On 4/11/2023 9:48 AM, Jason Simms wrote: Hello all, Regularly I'm seeing array jobs fail, and the only log info from the compute node is this: [2023-04-11T11:41:12.336] error: /opt/slurm/prolog.sh: exited

Re: [slurm-users] Slurmd enabled crash with CgroupV2

2023-03-10 Thread Brian Andrus
few things: [Unit] After=autofs.service getty.target sssd.service That makes it wait for all of those before trying to start. Brian Andrus On 3/10/2023 7:41 AM, Tristan LEFEBVRE wrote: Hello to all, I'm trying to do an installation of Slurm with cgroupv2 activated. But I'm facing

Re: [slurm-users] Cleanup of job_container/tmpfs

2023-03-06 Thread Brian Andrus
the node ensure the shared filesystems are mounted before allowing jobs. -Brian Andrus On 3/6/2023 1:15 AM, Niels Carl W. Hansen wrote: Hi all Seems there still are some issues with the autofs - job_container/tmpfs functionality in Slurm 23.02. If the required directories aren't mounted o

Re: [slurm-users] Power saving and node weight

2023-03-01 Thread Brian Andrus
do as well. I would be insterested in any alternatives. Could you point me to some doc? Best wishes Gizo Brian Andrus On 2/28/2023 7:44 AM, Gizo Nanava wrote: Hello, it seems that if a slurm power saving is enabled then the parameter "Weight" seem to be ignored for nodes tha

Re: [slurm-users] Chaining srun commands

2023-02-28 Thread Brian Andrus
get (resource-wise) and how do you want to use them? Brian Andrus On 2/28/2023 9:49 AM, Jake Jellinek wrote: Hi all I come from a SGE/UGE background and am used to the convention that I can qrsh to a node and, from there, start a new qrsh to a different node with different parameters. I&#x

Re: [slurm-users] Power saving and node weight

2023-02-28 Thread Brian Andrus
You may be able to use the alternate approach that I was able to do as well. Brian Andrus On 2/28/2023 7:44 AM, Gizo Nanava wrote: Hello, it seems that if a slurm power saving is enabled then the parameter "Weight" seem to be ignored for nodes that are in a power down state. Is

Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-27 Thread Brian Andrus
most jobs. Perhaps there is some additional lines that could be added to the job that would do a call to a snakemake API and report itself? Or maybe such an API could be created/expanded. Just a quick 2 cents (We may be up to a few dollars with all of those so far). Brian Andrus On 2/27/202

Re: [slurm-users] GPUs not available after making use of all threads?

2023-02-14 Thread Brian Andrus
formance answer lies in how any of the processes work, which is why some of us do so many experimental runs of jobs and gather timings. We have yet to see a 100% efficient process, but folks are improving things all the time. Brian Andrus On 2/13/2023 9:56 PM, Diego Zuccato wrote: I think tha

Re: [slurm-users] GPUs not available after making use of all threads?

2023-02-13 Thread Brian Andrus
efficient HPC jobs. The goal is that every process is utilizing the CPU as close to 100% as possible, which would render hyper-threading moot. Brian Andrus On 2/13/2023 12:15 AM, Hermann Schwärzler wrote: Hi Sebastian, I am glad I could help (although not exactly as expected :-). With

Re: [slurm-users] slurm and singularity

2023-02-08 Thread Brian Andrus
commands are xterm, a shell script containing srun commands, and srun (see the EXAMPLES section). *If no command is specified, then salloc runs the user's default shell.* Brian Andrus On 2/8/2023 7:01 AM, Jeffrey T Frey wrote: You may need srun to allocate a pty for the command.

Re: [slurm-users] slurm and singularity

2023-02-07 Thread Brian Andrus
Then cluster_run.sh would call sbatch along with the appropriate commands. Brian Andrus On 2/7/2023 9:31 AM, Groner, Rob wrote: I'm trying to setup the capability where a user can execute: $: sbatch script_to_run.sh and the end result is that a job is created on a node, and that job wi

Re: [slurm-users] Maintaining slurm config files for test and production clusters

2023-01-17 Thread Brian Andrus
y with the new (known good) config. Brian Andrus On 1/17/2023 12:36 PM, Groner, Rob wrote: So, you have two equal sized clusters, one for test and one for production?  Our test cluster is a small handful of machines compared to our production. We have a test slurm control node on a test cl

Re: [slurm-users] Maintaining slurm config files for test and production clusters

2023-01-04 Thread Brian Andrus
ready. Brian Andrus On 1/4/2023 9:22 AM, Groner, Rob wrote: We currently have a test cluster and a production cluster, all on the same network.  We try things on the test cluster, and then we gather those changes and make a change to the production cluster.  We're doing that through two diffe

Re: [slurm-users] slurmrestd service broken by 22.05.07 update

2022-12-29 Thread Brian Andrus
lurm/slurm.conf"*/ You can change those as needed. This made it listen on port 8081 only (no socket and not 6820) I was then able to just use curl on port 8081 to test things. Hope that helps. Brian Andrus On 12/29/2022 6:49 AM, Chris Stackpole wrote: Greetings, Thanks for responding

Re: [slurm-users] slurmrestd service broken by 22.05.07 update

2022-12-28 Thread Brian Andrus
I suspect if you delete /var/lib/slurmrestd.socket and then start slurmrestd, it will create it as the user you need it to be. Or just change the owner of it to the slurmrestd owner. I have been running slurmrestd as a separate user for some time. Brian Andrus On 12/28/2022 3:20 PM, Chris

Re: [slurm-users] Job cancelled into the future

2022-12-20 Thread Brian Andrus
Seems like the time may have been off on the db server at the insert/update. You may want to dump the database, find what table/records need updated and try updating them. If anything went south, you could restore from the dump. Brian Andrus On 12/20/2022 11:51 AM, Reed Dier wrote: Just to

Re: [slurm-users] Job cancelled into the future

2022-12-20 Thread Brian Andrus
Try:     sacctmgr list runawayjobs Brian Andrus On 12/20/2022 7:54 AM, Reed Dier wrote: Hoping this is a fairly simple one. This is a small internal cluster that we’ve been using for about 6 months now, and we’ve had some infrastructure instability in that time, which I think may be the

Re: [slurm-users] I can't seem to use all the CPUs in my Cluster?

2022-12-13 Thread Brian Andrus
the many articles, wikis and videos out there. TLDR; If you are going to be running efficient HPC jobs, you are indeed better off with HT turned off. Brian Andrus On 12/13/2022 8:03 AM, Gary Mansell wrote: Hi, thanks for getting back to me. I have been doing some more experimenting, and I

Re: [slurm-users] I can't seem to use all the CPUs in my Cluster?

2022-12-13 Thread Brian Andrus
assigned to it. Also check the state of the nodes with 'sinfo' It would also be good to ensure the node settings are right. Run 'slurmd -C' on a node and see if the output matches what is in the config. Brian Andrus On 12/13/2022 1:38 AM, Gary Mansell wrote: Dear Slurm Us

Re: [slurm-users] Job allocation from a heterogenous pool of nodes

2022-12-07 Thread Brian Andrus
You may want to look here: https://slurm.schedmd.com/heterogeneous_jobs.html Brian Andrus On 12/7/2022 12:42 AM, Le, Viet Duc wrote: Dear slurm community, I am encountering a unique situation where I need to allocate jobs to nodes with different numbers of CPU cores. For instance

Re: [slurm-users] Slurm v22 for Alma 8

2022-12-02 Thread Brian Andrus
I successfully build it for Rocky straight from the tgz file as usual with rpmbuild -ta Brian Andrus On 12/2/2022 9:21 AM, David Thompson wrote: Hi folks, I’m working on getting Slurm v22 RPMs built for our Alma 8 Slurm cluster. We would like to be able to use the sbatch –prefer option

Re: [slurm-users] Licenses: Remote vs Reservation

2022-11-30 Thread Brian Andrus
ed to submit at all? The reservation method can cause an sbatch command to be rejected, if that is what you are looking for. Brian Andrus On 11/30/2022 6:29 AM, Richard Ems wrote: Hi all, I have to change our set up to be able to update the total number of available licenses due to users che

Re: [slurm-users] How to launch slurm services after installation

2022-11-27 Thread Brian Andrus
Steve, I suspect you did not install the packages. You need to install slurm-slurmctld to get the slurmctld systemd files: /# rpm -qlp slurm-slurmctld-20.11.9-1.el7.x86_64.rpm// ///run/slurm/slurmctld.pid// /*//usr/lib/systemd/system/slurmctld.service/*/ ///usr/sbin/slurmctld//

Re: [slurm-users] Slurm: Handling nodes that fail to POWER_UP in a cloud scheduling system

2022-11-23 Thread Brian Andrus
reset/recreate it. That addresses even a miffed software change. Brian Andrus On 11/23/2022 5:11 AM, Xaver Stiensmeier wrote: Hello slurm-users, The question can be found in a similar fashion here: https://stackoverflow.com/questions/74529491/slurm-handling-nodes-that-fail-to-power-up-in-a

Re: [slurm-users] SlurmDBD losing connection to the backend MariaDB

2022-11-01 Thread Brian Andrus
processing data. There are many ways to do that, but those designs fall under MariaDB and not Slurm. Brian Andrus On 11/1/2022 6:49 PM, Richard Chang wrote: Does it mean it is best to use a single slurmdbd host in my case? My primary slurmctld is the backup slurmdbd host, and my worry is if t

Re: [slurm-users] SlurmDBD losing connection to the backend MariaDB

2022-11-01 Thread Brian Andrus
Ole, Fair enough, it is actually slurmctld that does the caching. Technical typo on my part there. Just trying to let the user know, there is a window that they have to ensure no information is lost during a database outage. Brian Andrus On 11/1/2022 1:43 AM, Ole Holm Nielsen wrote: Hi

Re: [slurm-users] SlurmDBD losing connection to the backend MariaDB

2022-10-31 Thread Brian Andrus
It caches up to a point. As I understand it, that is about an hour (depending on size and how busy the cluster is, as well as available memory, etc). Brian Andrus On 10/31/2022 9:20 PM, Richard Chang wrote: Hi, Just for my info, I would like to know what happens when SlurmDBD loses

Re: [slurm-users] Ideal NFS exported StateSaveLocation size.

2022-10-24 Thread Brian Andrus
YMMV, but if you aren't having excessive traffic to the share, you should be good. I have yet to discover what would be excessive enough to impact things. The only use I have had for the HA is being able to keep the cluster running/happy during maintenance. Brian Andrus On 10/24/2022 1:

  1   2   3   4   >