Before you get all excited about it, we have had a terrible time trying to get
gppu metrics. Finally abandoned and switch to Grafana, Prometheus influx.
Good luck to you though.
From: slurm-users on behalf of "Heckes,
Frank"
Reply-To: Slurm User Community List
Date: Wednesday, April 14,
Also check the settings on your nodeaddr in slurm.conf
On 3/22/21, 2:48 PM, "slurm-users on behalf of Michael Robbert"
wrote:
I haven't tried configless setup yet, but the problem you're hitting looks
like it could be a DNS issue. Can you do a dns lookup of n26 from the login
node? The w
Maybe it's Friday but I cannot for the life of me figure out how to update a
user's partitions.
Just trying to add a user access to another partition.
sacctmgr modify user where name=foo set partition=par1,par2,par3
Use keyword 'where' to modify condition
Tried pretty much all the permutation
Hello,
Not sure what your setup is but check compute nodes route table. Also might
need to turn on ipv4 forwarding on whatever is their default gw. Then also
firewalls can come in to play too. This isn't a slurm issue , pretty sure!
Matt
On 8/2/20, 7:53 AM, "slurm-users on behalf of Mahmo
Hello,
Running slurm 17.02.6 on a cray system and all of a sudden we have been
receiving these message errors from slurmstepd. Not sure what triggers this?
srun -N 4 -n 4 hostname
nid00031
slurmstepd: error: task/cgroup: unable to add task[pid=903] to memory cg
'(null)'
nid00029
nid00030
slurm
ttps://slurm.schedmd.com/resource_limits.
____
From: slurm-users on behalf of
Matthew BETTINGER
Sent: Tuesday, July 7, 2020 9:40 AM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Allow certain users to run over partition limit
Hello,
We have a slurm sy
Hello,
We have a slurm system with partitions set for max runtime of 24hours. What
would be the proper way to allow a certain set of users to run jobs on the
current partitions beyond the partition limits? In the past we would isolate
some nodes based on their job requirements , make a new pa
Just curious if this option or oom setting (which we use) can leave the nodes
in CG "completing" state. We have CG states quite often and only way is to
reboot the node. I believe it occurs when parent process dies or gets killed
or Z? Thanks.
MB
On 10/8/19, 6:11 AM, "slurm-users on behal
works best for you.
HTH
--Dani_L.
On 7/23/19 7:36 PM, Matthew BETTINGER wrote:
Hello,
We run lsf and slurm here. For LSF we have a weekend queue with no limit
and jobs get killed after Sunday. What is the best way to do something similar
for
Hello,
We run lsf and slurm here. For LSF we have a weekend queue with no limit and
jobs get killed after Sunday. What is the best way to do something similar for
slurm? Reservation? We would like to have any running jobs killed after
Sunday if possible too. Thanks.
se the epilog script, you can set the epilog script to
clean up all residues from the finished jobs:
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-prolog-and-epilog-scripts
Ahmet M.
28.05.2019 19:03 tarihinde Matthew BETTINGER yazdı:
> We
We use triggers for the obvious alerts but is that a way to make a trigger for
nodes stuck in CG (completing) state? Some user jobs, mostly Julia notebook
can get hung in completing state is the user kills the running job or cancels
it with cntrl. When this happens we can have many many nodes
qos works fine.
Just do not know enough about this and how to test again without causing
disruption. Inch high a mile wide over here with 3-4 different schedulers.
On 3/5/19, 11:29 AM, "slurm-users on behalf of Christopher Samuel"
wrote:
On 3/5/19 7:37 AM, Matthew BETTINGER wrote:
ould be different if that were the case.
Look at that, maybe send the QOS and partition config.
- Michael
On Tue, Mar 5, 2019 at 7:40 AM Matthew BETTINGER
wrote:
Hey slurm gurus. We have been trying to enable slurm
Hey slurm gurus. We have been trying to enable slurm QOS on a cray system here
off and on for quite a while but can never get it working. Every time we try
to enable QOS we disrupt the cluster and users and have to fall back. I'm not
sure what we are doing wrong. We run a pretty open system
We stuck avere between Isilon and a cluster to get us over the hump until next
budget cycle ... then we replaced with spectrascale for mid level storage.
Still use lustre of course as scratch.
On 2/22/19, 12:24 PM, "slurm-users on behalf of Will Dennis"
wrote:
(replies inline)
One of the main guy Panos left Bright so no answer to your specific question
but I hope you can get some support with it. We dumped our BC PoC, the
sysadmin working on the PoC still has nightmares.
On 2/13/19, 6:54 AM, "slurm-users on behalf of John Hearns"
wrote:
Yugendra, the Brigh
Not Sure if this is related but we ran into an issue configuring accounting
because our clustername had a '-' in the name . This is an illegal character
for table names in mariadb, or used to be.
On 1/17/19, 11:07 AM, "slurm-users on behalf of Sajesh Singh"
wrote:
Trying to setup acco
Hello,
We are trying to find a way to gather information about jobs assigned to GPU's.
I'm not really finding anything we want using sreport for some reason. We
would like to find cpu hours for our GPUSs. They are defined in gres and need
to be passed to srun when users request gpu's. It lo
MaxMemPerNode=UNLIMITED
I can run jobs in there but if I set it to just a user (myself) then the job
does not run. I may have to just make this partition like this until I can
figure out the correct way since we need this to run today.
On 1/3/19, 8:41 AM, "Matthew BETTINGER"
wrote:
Hello,
We are running slurm 17.02.6 with accounting on a cray CLE system.
We currently have a 24 hour job run limit on our partitions and a user needs to
run a job which will exceed 24 hours runtime. I tried to make a reservation as
seen below allocating the user 36 hours to run his job but it
21 matches
Mail list logo