Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-07 Thread Bjørn-Helge Mevik
Jonathon A Anderson writes: > ## Queue stuffing There is the bf_max_job_user SchedulerParameter, which is sort of the "poor man's MAXIJOB"; it limits the number of jobs from each user the backfiller will try to start on each run. It doesn't do exactly what you want, but at least the backfiller

Re: [slurm-users] slurmdbd: mysql/accounting errors on 17.11.6 upgrade

2018-05-07 Thread Ole Holm Nielsen
On 05/07/2018 10:19 PM, Tina Fora wrote: Hello, I upgraded from 17.02.10 to 17.11.6 on EL6.9 and getting the errors below. Database is on EL7 mariadb-5.5. Migrating to a new version of MySQL/MariaDB requires further steps on the database (unrelated to Slurm). You must run: mysql_upgrade a

Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-07 Thread Ryan Novosielski
One of these TRES-related ones in a QOS ought to do it: https://slurm.schedmd.com/resource_limits.html Your problem there, though, is you will eventually have stuff waiting to run it and when the system is idle. We had the same circumstance and the same eventual outcome. -- || \\UTGERS,

[slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-07 Thread Jonathon A Anderson
We have two main issues with our scheduling policy right now. The first is an issue that we call "queue stuffing." The second is an issue with interactive job availability. We aren't confused about why these issues exist, but we aren't sure the best way to address them. I'd love to hear any sug

Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Eric F. Alemany
Hi Andy, I followed your advice - md5sum /etc/munge/munge.key and they are the same on all systems. what else can it be? I will cheek the clock on the systems again as suggested by Chris Best, Eric ___

Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Riebs, Andy
The /etc/munge/ munge.key is different on the systems. Try md5sum /etc/munge/munge.key on both systems to see if they are the same... -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise +1 404 648 9024 From: slurm-users on behalf of Eric F. Alemany Sent

Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Chris Samuel
On Tuesday, 8 May 2018 9:40:53 AM AEST Eric F. Alemany wrote: > I followed the link as well as the instruction on “Securing the > installation” and “Testing the installation” Great. > The only thing that i am not able to do is: Check if a credential can be > remotely decoded So one possibility

Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Eric F. Alemany
Hi Chris, I followed the link as well as the instruction on “Securing the installation” and “Testing the installation” The only thing that i am not able to do is: Check if a credential can be remotely decoded eric@radoncmaster:/etc/munge$ munge -n | ssh e...@radonc01.stanford.edu

Re: [slurm-users] Limit number of specific concurrent jobs per node

2018-05-07 Thread Gareth.Williams
Hi Andreas, You could define a generic consumable resource per node and have the scheduling take account of requests for it. In principle, you could do this for say interface_bandwidth or io_bw and try and use real numbers, but in practice users don't know how much they need and will use and ad

Re: [slurm-users] slurmdbd: mysql/accounting errors on 17.11.6 upgrade

2018-05-07 Thread Chris Samuel
On Tuesday, 8 May 2018 6:19:16 AM AEST Tina Fora wrote: > slurmdbd: error: mysql_query failed: 1062 Duplicate entry > '3508-1399520701' for key 'id_job' That doesn't look good, not sure what to advise there. Do you have a backup of the database from before you started? If you've got a support

Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Chris Samuel
On Tuesday, 8 May 2018 8:38:47 AM AEST Eric F. Alemany wrote: > I thought i did but I will do it again If that doesn't work then check the "Securing the Installation" and "Testing the Installation" parts of the munge docs here (ignore the installation part): https://github.com/dun/munge/wiki/In

[slurm-users] srun seg faults immediately from within sbatch but not salloc

2018-05-07 Thread a . vitalis
Dear all, I am trying to set up a small cluster running slurm on Ubuntu 16.04. I installed slurm-17.11.5 along with pmix-2.1.1 on an NFS-shared partition. Installation seems fine. Munge is taken from the system package. Something like this: ./configure --prefix=/software/slurm/slurm-17.11.5 --ex

Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Eric F. Alemany
Hi Chris I thought i did but I will do it again Best, Eric _ Eric F. Alemany System Administrator for Research Division of Radiation & Cancer Biology Department of Radiation Oncology Stanford

Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Chris Samuel
On Tuesday, 8 May 2018 8:21:46 AM AEST Eric F. Alemany wrote: > copied the /etc/munge/munge.key from the master to all the nodes. > Checked the date on master and nodes - OK > > systemctl restart slurmctld on master > > systemctl restart slurmd on all nodes Did you restart munged as well? Tha

Re: [slurm-users] "Low socket*core*thre" - solution?

2018-05-07 Thread Chris Samuel
On Tuesday, 8 May 2018 2:27:07 AM AEST Mahmood Naderan wrote: > So the trick was to UNDRAIN the node and not RESUME it. That's strange, because UNDRAIN only does a subset of what RESUME does. "UNDRAIN" clears the node from being drained (like "RESUME"), but will not change the node'

Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Eric F. Alemany
Sorry to report that i still have the same problem. copied the /etc/munge/munge.key from the master to all the nodes. Checked the date on master and nodes - OK systemctl restart slurmctld on master systemctl restart slurmd on all nodes checked again /var/log/slurm-llnl/SlurmdLogFile.log [201

Re: [slurm-users] ReqNodeNotAvail, but none of nodes in partition are listed.

2018-05-07 Thread Prentice Bisbal
Fewer. ;) True. What was I thinking? sometimes even the person who set the reservation doesn’t figure it out. Like me/us? ;) Prentice On 05/07/2018 05:42 PM, Ryan Novosielski wrote: Fewer. ;) I think rumor had it that there were plans for some improvement in this area (you might check th

Re: [slurm-users] ReqNodeNotAvail, but none of nodes in partition are listed.

2018-05-07 Thread Ryan Novosielski
Fewer. ;) I think rumor had it that there were plans for some improvement in this area (you might check the bugs or this mailing list — I can’t remember where I saw it, but it was awhile back now), because ReqNodeNotAvail almost never means something useful, and reservations don’t actually gene

Re: [slurm-users] ReqNodeNotAvail, but none of nodes in partition are listed.

2018-05-07 Thread Prentice Bisbal
Dang it. That's it. I recently changed the default time limit on some of my partitions, to only 48 hours. I have a reservation that starts on Friday at 5 PM. These jobs are all assigned to partitions that still have longer time limits. I forgot that not all partitions have the new 48-hour limit

Re: [slurm-users] ReqNodeNotAvail, but none of nodes in partition are listed.

2018-05-07 Thread Ryan Novosielski
In my experience, it may say that even if it has nothing to do with the reason the job isn’t running, if there are nodes on the system that aren’t available. I assume you’ve checked for reservations? > On May 7, 2018, at 5:06 PM, Prentice Bisbal wrote: > > Dear Slurm Users, > > On my cluster,

[slurm-users] ReqNodeNotAvail, but none of nodes in partition are listed.

2018-05-07 Thread Prentice Bisbal
Dear Slurm Users, On my cluster, I have several partitions, each with their own QOS, time limits, etc. Several times today, I've received complaints from users that they submitted jobs to a partition with available nodes, but jobs are stuck in the PD state. I have spent the majority of my da

Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Eric F. Alemany
Thanks Paul. _ Eric F. Alemany System Administrator for Research Division of Radiation & Cancer Biology Department of Radiation Oncology Stanford University School of Medicine Stanford, Califor

[slurm-users] slurmdbd: mysql/accounting errors on 17.11.6 upgrade

2018-05-07 Thread Tina Fora
Hello, I upgraded from 17.02.10 to 17.11.6 on EL6.9 and getting the errors below. Database is on EL7 mariadb-5.5. After yum update slurm: # slurmdbd -D -vvv slurmdbd: debug: Log file re-opened slurmdbd: debug: Munge authentication plugin loaded slurmdbd: debug2: mysql_connect() called for db s

Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Paul Edmon
Any command can be used to copy it.  We deploy ours using puppet. -Paul Edmon- On 05/07/2018 04:04 PM, Eric F. Alemany wrote: Thanks Andy. I think i omit a big step which is copying the /etc/munge/munge.key from master/headnode to all the /etc/munge/munge/key in the nodes - am i right?   i

Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Eric F. Alemany
Thanks Andy. I think i omit a big step which is copying the /etc/munge/munge.key from master/headnode to all the /etc/munge/munge/key in the nodes - am i right? i dont recall doing this so that could be the problem. Is there a specific command i need to do to copy the munge.key from the mast

Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Andy Riebs
The two most likely causes of munge complaints: 1. Different keys in /etc/munge/munge.key 2. Clocks out of sync on the nodes in question Andy On 05/07/2018 03:50 PM, Eric F. Alemany wrote: Greetings, Reminder: i am new to SLURM. When i execute  “sinfo” my nodes are down. sinfo PARTITION AV

[slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Eric F. Alemany
Greetings, Reminder: i am new to SLURM. When i execute “sinfo” my nodes are down. sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 4 down* radonc[01-04] This is what i have done so far and nothing has helped. The nodes are in “idle” state for 2-3 minute

Re: [slurm-users] sacct: error

2018-05-07 Thread Eric F. Alemany
Thank you Chris, Marcus, Patrick and Ray. I guess i am still a bit confused. We will se what happen when we run a job asking for the CPU’s of the cluster. _ Eric F. Alemany System Administrator

Re: [slurm-users] Limit number of specific concurrent jobs per node

2018-05-07 Thread Mahmood Naderan
Hi You may have to read --nodelist in the sbatch manual https://slurm.schedmd.com/sbatch.html On Mon, May 7, 2018, 21:29 Andreas Hilboll wrote: > Dear SLURM experts, > > we have a cluster of 56 nodes with 28 cores each. Is it possible > to > limit the number of jobs of a certain name which co

[slurm-users] Limit number of specific concurrent jobs per node

2018-05-07 Thread Andreas Hilboll
Dear SLURM experts, we have a cluster of 56 nodes with 28 cores each. Is it possible to limit the number of jobs of a certain name which concurrently run on one node, without blocking the node for other jobs? For example, when I do for filename in runtimes/*/jobscript.sh; do sbatch -J

Re: [slurm-users] "Low socket*core*thre" - solution?

2018-05-07 Thread Mahmood Naderan
O yes that was brilliant. [root@rocks7 mahmood]# scontrol show node rocks7 NodeName=rocks7 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=1 CPULoad=0.02 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=10.1.1.1 NodeHostName=rocks7 Version=17.11 OS=Linu

Re: [slurm-users] Memory oversubscription and sheduling

2018-05-07 Thread Cory Holcomb
Thank you, for the reply I was beginning to wonder if my message was seen. While I understand how batch systems work, if you have a system daemon that develops a memory leak and consumes the memory outside of allocation. Not checking the used memory on the box before dispatch seems like a good w

[slurm-users] srun --reboot in sbatch

2018-05-07 Thread Tueur Volvo
hello i try to reboot my node in sbatch when i run srun hostname it's work whe i run : srun --reboot hostname it's work, my slurmd node reboot and execute hostname but i create sbatch file like this : #!/bin/bash -l #SBATCH --output=/nfs/myoutput.txt # Jobs Steps: echo "begin" srun hostname

Re: [slurm-users] sacct: error

2018-05-07 Thread Chris Samuel
On Monday, 7 May 2018 5:41:27 PM AEST Marcus Wagner wrote: > To me it looks like CPUs is the synonym for hardware threads. Interesting, at ${JOB-1} we experimented with HT on a system back in 2013 and I didn't do the slurm.conf side at that time, but then you could only request physical cores a

Re: [slurm-users] "Low socket*core*thre" - solution?

2018-05-07 Thread Werner Saar
Hi Mahmood, Please try the following commands on rocks7: systemctl restart slurmd systemctl restart slurmctld scontrol update node=rocks7 state=undrain Best regards Werner On 05/06/2018 02:09 PM, Mahmood Naderan wrote: Still I think for some reasons, slurms put the frontend in drain stat

Re: [slurm-users] sacct: error

2018-05-07 Thread Marcus Wagner
Hi Chris, this is not correct. From the slurm.conf manpage: CPUs: Number of logical processors on the node (e.g. "2").  CPUs and Boards are mutually exclusive. It can be set to the total number of sockets, cores or threads. This can be useful when you want to schedule only  the  cores on a hy