[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo

2024-02-24 Thread Chris Samuel via slurm-users

On 24/2/24 06:14, Robert Kudyba via slurm-users wrote:

For now I just set it to chmod 777 on /tmp and that fixed the errors. Is 
there a better option?


Traditionally /tmp and /var/tmp have been 1777 (that "1" being the 
sticky bit, originally invented to indicate that the OS should attempt 
to keep a frequently used binary in memory but then adopted to indicate 
special handling of a world writeable directory so users can only unlink 
objects they own and not others).


Hope that helps!

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-27 Thread Chris Samuel via slurm-users

On 26/2/24 12:27 am, Josef Dvoracek via slurm-users wrote:


What is the recommended way to run longer interactive job at your systems?


We provide NX for our users and also access via JupyterHub.

We also have high priority QOS's intended for interactive use for rapid 
response, but they are capped at 4 hours (or 6 hours for Jupyter users).


All the best,
Chris

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: REST API - get_user_environment

2024-08-27 Thread Chris Samuel via slurm-users

On 15/8/24 10:55 am, jpuerto--- via slurm-users wrote:


Any ideas on whether there's a way to mirror this functionality in v0.0.40?


Sorry for not seeing this sooner, I don't I'm afraid!

All the best,
Chris

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: REST API - get_user_environment

2024-08-27 Thread Chris Samuel via slurm-users

On 22/8/24 11:18 am, jpuerto--- via slurm-users wrote:


Do you have a link to that code? Haven't had any luck finding that repo


It's here (on the 23.11 branch):

https://github.com/SchedMD/slurm/tree/slurm-23.11/src/slurmrestd/plugins/openapi/dbv0.0.38

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: REST API - get_user_environment

2024-08-27 Thread Chris Samuel via slurm-users

On 27/8/24 10:26 am, jpuerto--- via slurm-users wrote:


Is anyone in contact with the development team?


Folks with a support contract can submit bugs at 
https://support.schedmd.com/



I feel that this is pretty basic functionality that was removed from the REST API without 
warning. Considering that this was a "patch" release (based on traditional 
semantic versioning guidelines), this type of modification shouldn't have happened and 
makes me worry about upgrading in the future.


Slurm hasn't used semantic versioning for a long time, they moved to a 
year.month.minor version system a long time ago. The major releases are 
(now) every 6 months, so the most recent ones have been:


* 23.02.0
* 23.11.0 (old 9 month system)
* 24.05.0 (new 6 month system)

Next major release should be in November:

* 24.11.0

All the best,
Chris

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Spread a multistep job across clusters

2024-08-27 Thread Chris Samuel via slurm-users

On 26/8/24 8:40 am, Di Bernardini, Fabio via slurm-users wrote:

Hi everyone, for accounting reasons, I need to create only one job 
across two or more federated clusters with two or more srun steps.


The limitations for heterogenous jobs say:

https://slurm.schedmd.com/heterogeneous_jobs.html#limitations

> In a federation of clusters, a heterogeneous job will execute
> entirely on the cluster from which the job is submitted. The
> heterogeneous job will not be eligible to migrate between clusters
> or to have different components of the job execute on different
> clusters in the federation.

However, from your script it's not clear to me that's what you're 
meaning, because you include multiple --cluster options. I'm not sure if 
that works, as you mention the docs don't cover that case. They do say 
(however) that:


> If a heterogeneous job is submitted to run in multiple clusters not
> part of a federation (e.g. "sbatch --cluster=alpha,beta ...") then
> the entire job will be sent to the cluster expected to be able to
> start all components at the earliest time.

My gut instinct is that this isn't going to work, my feeling is that to 
launch a heterogenous job like this requires the slurmctld's on each 
cluster to coordinate and I'm not aware of that being possible currently.


All the best,
Chris

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld

2025-02-02 Thread Chris Samuel via slurm-users

On 2/2/25 2:46 pm, Steven Jones via slurm-users wrote:


[2025-01-30T19:45:29.024] error: Security violation, ping RPC from uid 12002


Looking at the code that seems to come from this code:

if (!_slurm_authorized_user(msg->auth_uid)) {
error("Security violation, batch launch RPC from uid %u",
  msg->auth_uid);
rc = ESLURM_USER_ID_MISSING;  /* or bad in this case */
goto done;
}


and what it is calling is:

/*
 *  Returns true if "uid" is a "slurm authorized user" - i.e. uid == 0
 *   or uid == slurm user id at this time.
 */
static bool
_slurm_authorized_user(uid_t uid)
{
return ((uid == (uid_t) 0) || (uid == slurm_conf.slurm_user_id));
}


Is it possible you're trying to run Slurm as a user other than root or 
the user designated as the "SlurmUser" in your config?


Also check that whoever you have set as the SlurmUser has the same UID 
everywhere (in fact everyone should do).


All the best,
Chris

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: RHEL8.10 V slurmctld

2025-02-02 Thread Chris Samuel via slurm-users

On 29/1/25 10:44 am, Steven Jones via slurm-users wrote:

"2025-01-28T21:48:50.271] sched: Allocate JobId=4 NodeList=node4 #CPUs=1 
Partition=debug
[2025-01-28T21:48:50.280] Killing non-startable batch JobId=4: Invalid 
user id"


Looking at the source code it looks like that second error is reported 
back by slurmctld when it sends the RPC out to the compute node and it 
gets a response back, so I would look at what's going on with node4 to 
see what's being reported there.


All the best,
Chris

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld

2025-02-02 Thread Chris Samuel via slurm-users

On 2/2/25 1:54 pm, Steven Jones via slurm-users wrote:

Thanks for the reply.  I already went through this 🙁.  I checked all 
nodes, id works as does a ssh login.


What is in your slurmd logs on that node?

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld

2025-02-02 Thread Chris Samuel via slurm-users

On 2/2/25 3:46 pm, Steven Jones wrote:

I have never done a HPC before, it is all new to me so I can be making 
"newbie errors".   The old HPC has been dumped on us so I am trying to 
build it "professionally" shall we say  ie documented, stable and I will 
train ppl to build it  (all this with no money at all).


No worries at all! It would be good to know what this says:

scontrol show config | fgrep -i slurmuser

If that doesn't say "root" what does the "id" command say for that user 
on both the system where slurmctld is running and on node4?


Also on the node where slurmctld is running what does this say?

ps auxwww | fgrep slurmctld

Best of luck!
Chris

(you can tell I'm stranded at SFO until tonight due to American Airlines 
pulling the plane for my morning flight out of service. Still I'd rather 
than than be another news headline)


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld

2025-02-02 Thread Chris Samuel via slurm-users

On 2/2/25 4:18 pm, Steven Jones via slurm-users wrote:


isn't it slurmd on the compute nodes?


It is, but as this check is (I think) happening on the compute node I 
was wanting to check who slurmctld was running as.


The only other thought I have is what is in the compute nodes slurm.conf 
as the SlurmUser? I wonder if that's set to root? If so it wouldn't know 
that the "slurm" user was authorised.


Usually those are in step though. Everything else you've shown seems it 
be in order.


All the best,
Chris

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: node3 not working - down

2024-12-09 Thread Chris Samuel via slurm-users

On 9/12/24 5:44 pm, Steven Jones via slurm-users wrote:


[2024-12-09T23:38:56.645] error: Munge decode failed: Rewound credential
[2024-12-09T23:38:56.645] auth/munge: _print_cred: ENCODED: Tue Dec 10 
23:38:30 2024
[2024-12-09T23:38:56.645] auth/munge: _print_cred: DECODED: Mon Dec 09 
23:38:56 2024

[2024-12-09T23:38:56.645] error: Check for out of sync clocks


One system is 24 hours behind/ahead of the other.

You should make sure NTP is set up and working on all these nodes.

--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Run a command in Slurm with all streams and signals connected to the submitting command

2025-04-04 Thread Chris Samuel via slurm-users

On 4/4/25 5:23 am, Michael Milton via slurm-users wrote:

Plain srun re-uses the existing Slurm allocation, and specifying 
resources like --mem will just request then from the current job rather 
than submitting a new one


srun does that as it sees all the various SLURM_* environment variables 
in the environment of the running job. My bet would be that if you 
eliminated them from the environment of the srun then you would get a 
new allocation.


I've done similar things in the past to do an sbatch for a job that 
wants to run on very different hardware with:


env $(env | awk -F= '/^(SLURM|SBATCH)/ {print "-u",$1}' | paste -s -d\ ) 
sbatch [...]


So it could be worth substituting srun for sbatch there and see if that 
helps.


Best of luck!
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm

2025-04-16 Thread Chris Samuel via slurm-users

Hiya!

On 16/4/25 12:56 am, lyz--- via slurm-users wrote:


I've tried version 23.11.10. It does work.


Oh that's wonderful, so glad it helped! It did seem quite odd that it 
wasn't working for you before then. I wonder if this was a cgroups v1 vs 
cgroups v2 thing?


All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Please help - Building Slurm-24.11.1 Failed

2025-02-23 Thread Chris Samuel via slurm-users

On 22/2/25 9:04 pm, Zhang, Yuan via slurm-users wrote:

I got errors about missing perl modules when building slurm24.11.1 rpm 
packages.  Has anyone seen this error before? And how to fix it?


If my memory serves ne right I would see those same errors when building 
Slurn for Cray XC in a chroot into an OS image that it was needed for.


The weird thing was it would only happen the very first time it was 
built in that chroot, every time after that (in the same OS image) it 
would work. Never did get to the bottom of what the cause was and those 
systems are gone now.


Why Perl specifically I have no idea, it's not like it changes all the time!

--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Please help - Building Slurm-24.11.1 Failed

2025-02-23 Thread Chris Samuel via slurm-users

On 23/2/25 9:49 am, Zhang, Yuan via slurm-users wrote:

Thanks for your input. The error I see may not be the same as what you 
had on the Cray system, but it shed some lights on the troubleshooting 
direction.


My pleasure, I'm so glad that helped point the way!

Best of luck on your endeavours.

--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com