[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo
On 24/2/24 06:14, Robert Kudyba via slurm-users wrote: For now I just set it to chmod 777 on /tmp and that fixed the errors. Is there a better option? Traditionally /tmp and /var/tmp have been 1777 (that "1" being the sticky bit, originally invented to indicate that the OS should attempt to keep a frequently used binary in memory but then adopted to indicate special handling of a world writeable directory so users can only unlink objects they own and not others). Hope that helps! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?
On 26/2/24 12:27 am, Josef Dvoracek via slurm-users wrote: What is the recommended way to run longer interactive job at your systems? We provide NX for our users and also access via JupyterHub. We also have high priority QOS's intended for interactive use for rapid response, but they are capped at 4 hours (or 6 hours for Jupyter users). All the best, Chris -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: REST API - get_user_environment
On 15/8/24 10:55 am, jpuerto--- via slurm-users wrote: Any ideas on whether there's a way to mirror this functionality in v0.0.40? Sorry for not seeing this sooner, I don't I'm afraid! All the best, Chris -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: REST API - get_user_environment
On 22/8/24 11:18 am, jpuerto--- via slurm-users wrote: Do you have a link to that code? Haven't had any luck finding that repo It's here (on the 23.11 branch): https://github.com/SchedMD/slurm/tree/slurm-23.11/src/slurmrestd/plugins/openapi/dbv0.0.38 -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: REST API - get_user_environment
On 27/8/24 10:26 am, jpuerto--- via slurm-users wrote: Is anyone in contact with the development team? Folks with a support contract can submit bugs at https://support.schedmd.com/ I feel that this is pretty basic functionality that was removed from the REST API without warning. Considering that this was a "patch" release (based on traditional semantic versioning guidelines), this type of modification shouldn't have happened and makes me worry about upgrading in the future. Slurm hasn't used semantic versioning for a long time, they moved to a year.month.minor version system a long time ago. The major releases are (now) every 6 months, so the most recent ones have been: * 23.02.0 * 23.11.0 (old 9 month system) * 24.05.0 (new 6 month system) Next major release should be in November: * 24.11.0 All the best, Chris -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Spread a multistep job across clusters
On 26/8/24 8:40 am, Di Bernardini, Fabio via slurm-users wrote: Hi everyone, for accounting reasons, I need to create only one job across two or more federated clusters with two or more srun steps. The limitations for heterogenous jobs say: https://slurm.schedmd.com/heterogeneous_jobs.html#limitations > In a federation of clusters, a heterogeneous job will execute > entirely on the cluster from which the job is submitted. The > heterogeneous job will not be eligible to migrate between clusters > or to have different components of the job execute on different > clusters in the federation. However, from your script it's not clear to me that's what you're meaning, because you include multiple --cluster options. I'm not sure if that works, as you mention the docs don't cover that case. They do say (however) that: > If a heterogeneous job is submitted to run in multiple clusters not > part of a federation (e.g. "sbatch --cluster=alpha,beta ...") then > the entire job will be sent to the cluster expected to be able to > start all components at the earliest time. My gut instinct is that this isn't going to work, my feeling is that to launch a heterogenous job like this requires the slurmctld's on each cluster to coordinate and I'm not aware of that being possible currently. All the best, Chris -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld
On 2/2/25 2:46 pm, Steven Jones via slurm-users wrote: [2025-01-30T19:45:29.024] error: Security violation, ping RPC from uid 12002 Looking at the code that seems to come from this code: if (!_slurm_authorized_user(msg->auth_uid)) { error("Security violation, batch launch RPC from uid %u", msg->auth_uid); rc = ESLURM_USER_ID_MISSING; /* or bad in this case */ goto done; } and what it is calling is: /* * Returns true if "uid" is a "slurm authorized user" - i.e. uid == 0 * or uid == slurm user id at this time. */ static bool _slurm_authorized_user(uid_t uid) { return ((uid == (uid_t) 0) || (uid == slurm_conf.slurm_user_id)); } Is it possible you're trying to run Slurm as a user other than root or the user designated as the "SlurmUser" in your config? Also check that whoever you have set as the SlurmUser has the same UID everywhere (in fact everyone should do). All the best, Chris -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: RHEL8.10 V slurmctld
On 29/1/25 10:44 am, Steven Jones via slurm-users wrote: "2025-01-28T21:48:50.271] sched: Allocate JobId=4 NodeList=node4 #CPUs=1 Partition=debug [2025-01-28T21:48:50.280] Killing non-startable batch JobId=4: Invalid user id" Looking at the source code it looks like that second error is reported back by slurmctld when it sends the RPC out to the compute node and it gets a response back, so I would look at what's going on with node4 to see what's being reported there. All the best, Chris -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld
On 2/2/25 1:54 pm, Steven Jones via slurm-users wrote: Thanks for the reply. I already went through this 🙁. I checked all nodes, id works as does a ssh login. What is in your slurmd logs on that node? -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld
On 2/2/25 3:46 pm, Steven Jones wrote: I have never done a HPC before, it is all new to me so I can be making "newbie errors". The old HPC has been dumped on us so I am trying to build it "professionally" shall we say ie documented, stable and I will train ppl to build it (all this with no money at all). No worries at all! It would be good to know what this says: scontrol show config | fgrep -i slurmuser If that doesn't say "root" what does the "id" command say for that user on both the system where slurmctld is running and on node4? Also on the node where slurmctld is running what does this say? ps auxwww | fgrep slurmctld Best of luck! Chris (you can tell I'm stranded at SFO until tonight due to American Airlines pulling the plane for my morning flight out of service. Still I'd rather than than be another news headline) -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld
On 2/2/25 4:18 pm, Steven Jones via slurm-users wrote: isn't it slurmd on the compute nodes? It is, but as this check is (I think) happening on the compute node I was wanting to check who slurmctld was running as. The only other thought I have is what is in the compute nodes slurm.conf as the SlurmUser? I wonder if that's set to root? If so it wouldn't know that the "slurm" user was authorised. Usually those are in step though. Everything else you've shown seems it be in order. All the best, Chris -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: node3 not working - down
On 9/12/24 5:44 pm, Steven Jones via slurm-users wrote: [2024-12-09T23:38:56.645] error: Munge decode failed: Rewound credential [2024-12-09T23:38:56.645] auth/munge: _print_cred: ENCODED: Tue Dec 10 23:38:30 2024 [2024-12-09T23:38:56.645] auth/munge: _print_cred: DECODED: Mon Dec 09 23:38:56 2024 [2024-12-09T23:38:56.645] error: Check for out of sync clocks One system is 24 hours behind/ahead of the other. You should make sure NTP is set up and working on all these nodes. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Run a command in Slurm with all streams and signals connected to the submitting command
On 4/4/25 5:23 am, Michael Milton via slurm-users wrote: Plain srun re-uses the existing Slurm allocation, and specifying resources like --mem will just request then from the current job rather than submitting a new one srun does that as it sees all the various SLURM_* environment variables in the environment of the running job. My bet would be that if you eliminated them from the environment of the srun then you would get a new allocation. I've done similar things in the past to do an sbatch for a job that wants to run on very different hardware with: env $(env | awk -F= '/^(SLURM|SBATCH)/ {print "-u",$1}' | paste -s -d\ ) sbatch [...] So it could be worth substituting srun for sbatch there and see if that helps. Best of luck! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm
Hiya! On 16/4/25 12:56 am, lyz--- via slurm-users wrote: I've tried version 23.11.10. It does work. Oh that's wonderful, so glad it helped! It did seem quite odd that it wasn't working for you before then. I wonder if this was a cgroups v1 vs cgroups v2 thing? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Please help - Building Slurm-24.11.1 Failed
On 22/2/25 9:04 pm, Zhang, Yuan via slurm-users wrote: I got errors about missing perl modules when building slurm24.11.1 rpm packages. Has anyone seen this error before? And how to fix it? If my memory serves ne right I would see those same errors when building Slurn for Cray XC in a chroot into an OS image that it was needed for. The weird thing was it would only happen the very first time it was built in that chroot, every time after that (in the same OS image) it would work. Never did get to the bottom of what the cause was and those systems are gone now. Why Perl specifically I have no idea, it's not like it changes all the time! -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Please help - Building Slurm-24.11.1 Failed
On 23/2/25 9:49 am, Zhang, Yuan via slurm-users wrote: Thanks for your input. The error I see may not be the same as what you had on the Cray system, but it shed some lights on the troubleshooting direction. My pleasure, I'm so glad that helped point the way! Best of luck on your endeavours. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com