from:"Matthew BETTINGER"

Re: [slurm-users] derived counters

2021-04-14 Thread Matthew BETTINGER

Before you get all excited about it, we have had a terrible time trying to get gppu metrics. Finally abandoned and switch to Grafana, Prometheus influx. Good luck to you though. From: slurm-users on behalf of "Heckes, Frank" Reply-To: Slurm User Community List Date: Wednesday, April 14,

Re: [slurm-users] [External] srun at front-end nodes with --enable_configless fails with "Can't find an address, check slurm.conf"

2021-03-22 Thread Matthew BETTINGER

Also check the settings on your nodeaddr in slurm.conf On 3/22/21, 2:48 PM, "slurm-users on behalf of Michael Robbert" wrote: I haven't tried configless setup yet, but the problem you're hitting looks like it could be a DNS issue. Can you do a dns lookup of n26 from the login node? The w

[slurm-users] Update users partitions

2020-08-21 Thread Matthew BETTINGER

Maybe it's Friday but I cannot for the life of me figure out how to update a user's partitions. Just trying to add a user access to another partition. sacctmgr modify user where name=foo set partition=par1,par2,par3 Use keyword 'where' to modify condition Tried pretty much all the permutation

Re: [slurm-users] Internet connection loss with srun to a node

2020-08-03 Thread Matthew BETTINGER

Hello, Not sure what your setup is but check compute nodes route table. Also might need to turn on ipv4 forwarding on whatever is their default gw. Then also firewalls can come in to play too. This isn't a slurm issue , pretty sure! Matt On 8/2/20, 7:53 AM, "slurm-users on behalf of Mahmo

[slurm-users] Slurmstepd errors

2020-07-28 Thread Matthew BETTINGER

Hello, Running slurm 17.02.6 on a cray system and all of a sudden we have been receiving these message errors from slurmstepd. Not sure what triggers this? srun -N 4 -n 4 hostname nid00031 slurmstepd: error: task/cgroup: unable to add task[pid=903] to memory cg '(null)' nid00029 nid00030 slurm

Re: [slurm-users] Allow certain users to run over partition limit

2020-07-08 Thread Matthew BETTINGER

ttps://slurm.schedmd.com/resource_limits. ____ From: slurm-users on behalf of Matthew BETTINGER Sent: Tuesday, July 7, 2020 9:40 AM To: slurm-users@lists.schedmd.com Subject: [slurm-users] Allow certain users to run over partition limit Hello, We have a slurm sy

[slurm-users] Allow certain users to run over partition limit

2020-07-07 Thread Matthew BETTINGER

Hello, We have a slurm system with partitions set for max runtime of 24hours. What would be the proper way to allow a certain set of users to run jobs on the current partitions beyond the partition limits? In the past we would isolate some nodes based on their job requirements , make a new pa

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-09 Thread Matthew BETTINGER

Just curious if this option or oom setting (which we use) can leave the nodes in CG "completing" state. We have CG states quite often and only way is to reboot the node. I believe it occurs when parent process dies or gets killed or Z? Thanks. MB On 10/8/19, 6:11 AM, "slurm-users on behal

Re: [slurm-users] Weekend Partition

2019-07-23 Thread Matthew BETTINGER

works best for you. HTH --Dani_L. On 7/23/19 7:36 PM, Matthew BETTINGER wrote: Hello, We run lsf and slurm here. For LSF we have a weekend queue with no limit and jobs get killed after Sunday. What is the best way to do something similar for

[slurm-users] Weekend Partition

2019-07-23 Thread Matthew BETTINGER

Hello, We run lsf and slurm here. For LSF we have a weekend queue with no limit and jobs get killed after Sunday. What is the best way to do something similar for slurm? Reservation? We would like to have any running jobs killed after Sunday if possible too. Thanks.

Re: [slurm-users] strigger on CG, completing state

2019-05-29 Thread Matthew BETTINGER

se the epilog script, you can set the epilog script to clean up all residues from the finished jobs: https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-prolog-and-epilog-scripts Ahmet M. 28.05.2019 19:03 tarihinde Matthew BETTINGER yazdı: > We

[slurm-users] strigger on CG, completing state

2019-05-28 Thread Matthew BETTINGER

We use triggers for the obvious alerts but is that a way to make a trigger for nodes stuck in CG (completing) state? Some user jobs, mostly Julia notebook can get hung in completing state is the user kills the running job or cancels it with cntrl. When this happens we can have many many nodes

Re: [slurm-users] How to enable QOS correctly?

2019-03-05 Thread Matthew BETTINGER

qos works fine. Just do not know enough about this and how to test again without causing disruption. Inch high a mile wide over here with 3-4 different schedulers. On 3/5/19, 11:29 AM, "slurm-users on behalf of Christopher Samuel" wrote: On 3/5/19 7:37 AM, Matthew BETTINGER wrote:

Re: [slurm-users] How to enable QOS correctly?

2019-03-05 Thread Matthew BETTINGER

ould be different if that were the case. Look at that, maybe send the QOS and partition config. - Michael On Tue, Mar 5, 2019 at 7:40 AM Matthew BETTINGER wrote: Hey slurm gurus. We have been trying to enable slurm

[slurm-users] How to enable QOS correctly?

2019-03-05 Thread Matthew BETTINGER

Hey slurm gurus. We have been trying to enable slurm QOS on a cray system here off and on for quite a while but can never get it working. Every time we try to enable QOS we disrupt the cluster and users and have to fall back. I'm not sure what we are doing wrong. We run a pretty open system

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Matthew BETTINGER

We stuck avere between Isilon and a cluster to get us over the hump until next budget cycle ... then we replaced with spectrascale for mid level storage. Still use lustre of course as scratch. On 2/22/19, 12:24 PM, "slurm-users on behalf of Will Dennis" wrote: (replies inline)

Re: [slurm-users] New Bright Cluster Slurm issue for AD users

2019-02-13 Thread Matthew BETTINGER

One of the main guy Panos left Bright so no answer to your specific question but I hope you can get some support with it. We dumped our BC PoC, the sysadmin working on the PoC still has nightmares. On 2/13/19, 6:54 AM, "slurm-users on behalf of John Hearns" wrote: Yugendra, the Brigh

Re: [slurm-users] SlurmDBD setup with mysql

2019-01-17 Thread Matthew BETTINGER

Not Sure if this is related but we ran into an issue configuring accounting because our clustername had a '-' in the name . This is an illegal character for table names in mariadb, or used to be. On 1/17/19, 11:07 AM, "slurm-users on behalf of Sajesh Singh" wrote: Trying to setup acco

[slurm-users] Report on gres usage

2019-01-15 Thread Matthew BETTINGER

Hello, We are trying to find a way to gather information about jobs assigned to GPU's. I'm not really finding anything we want using sreport for some reason. We would like to find cpu hours for our GPUSs. They are defined in gres and need to be passed to srun when users request gpu's. It lo

Re: [slurm-users] Reservation to exceed time limit on a partition for a user

2019-01-03 Thread Matthew BETTINGER

MaxMemPerNode=UNLIMITED I can run jobs in there but if I set it to just a user (myself) then the job does not run. I may have to just make this partition like this until I can figure out the correct way since we need this to run today. On 1/3/19, 8:41 AM, "Matthew BETTINGER" wrote:

[slurm-users] Reservation to exceed time limit on a partition for a user

2019-01-03 Thread Matthew BETTINGER

Hello, We are running slurm 17.02.6 with accounting on a cray CLE system. We currently have a 24 hour job run limit on our partitions and a user needs to run a job which will exceed 24 hours runtime. I tried to make a reservation as seen below allocating the user 36 hours to run his job but it

Re: [slurm-users] derived counters

Re: [slurm-users] [External] srun at front-end nodes with --enable_configless fails with "Can't find an address, check slurm.conf"

[slurm-users] Update users partitions

Re: [slurm-users] Internet connection loss with srun to a node

[slurm-users] Slurmstepd errors

Re: [slurm-users] Allow certain users to run over partition limit

[slurm-users] Allow certain users to run over partition limit

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

Re: [slurm-users] Weekend Partition

[slurm-users] Weekend Partition

Re: [slurm-users] strigger on CG, completing state

[slurm-users] strigger on CG, completing state

Re: [slurm-users] How to enable QOS correctly?

Re: [slurm-users] How to enable QOS correctly?

[slurm-users] How to enable QOS correctly?

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

Re: [slurm-users] New Bright Cluster Slurm issue for AD users

Re: [slurm-users] SlurmDBD setup with mysql

[slurm-users] Report on gres usage

Re: [slurm-users] Reservation to exceed time limit on a partition for a user

[slurm-users] Reservation to exceed time limit on a partition for a user

21 matches

Site Navigation

Mail list logo

Footer information