Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-10 Thread Chris Samuel
On Thursday, 10 May 2018 1:02:36 AM AEST Eric F. Alemany wrote: > All seem good for now Great news! -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-09 Thread Eric F. Alemany
Good Morning (at least for those on the West coast of the US) My nodes are no longer “down” eric@radoncmaster:~$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 4 idle radonc[01-04] I think the NTP configuration did the trick So one possibility there is

Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Eric F. Alemany
+1 404 648 9024 From: slurm-users mailto:slurm-users-boun...@lists.schedmd.com>> on behalf of Eric F. Alemany mailto:ealem...@stanford.edu>> Sent: Monday, May 7, 2018 7:40:53 PM To: Slurm User Community List Subject: Re: [slurm-users] Nodes are down after 2-3 minutes. Hi Chris,

Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Riebs, Andy
Sent: Monday, May 7, 2018 7:40:53 PM To: Slurm User Community List Subject: Re: [slurm-users] Nodes are down after 2-3 minutes. Hi Chris, I followed the link as well as the instruction on “Securing the installation” and “Testing the installation” The only thing that i am not able to do is: Check

Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Chris Samuel
On Tuesday, 8 May 2018 9:40:53 AM AEST Eric F. Alemany wrote: > I followed the link as well as the instruction on “Securing the > installation” and “Testing the installation” Great. > The only thing that i am not able to do is: Check if a credential can be > remotely decoded So one possibility

Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Eric F. Alemany
Hi Chris, I followed the link as well as the instruction on “Securing the installation” and “Testing the installation” The only thing that i am not able to do is: Check if a credential can be remotely decoded eric@radoncmaster:/etc/munge$ munge -n | ssh e...@radonc01.stanford.edu

Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Chris Samuel
On Tuesday, 8 May 2018 8:38:47 AM AEST Eric F. Alemany wrote: > I thought i did but I will do it again If that doesn't work then check the "Securing the Installation" and "Testing the Installation" parts of the munge docs here (ignore the installation part): https://github.com/dun/munge/wiki/In

Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Eric F. Alemany
Hi Chris I thought i did but I will do it again Best, Eric _ Eric F. Alemany System Administrator for Research Division of Radiation & Cancer Biology Department of Radiation Oncology Stanford

Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Chris Samuel
On Tuesday, 8 May 2018 8:21:46 AM AEST Eric F. Alemany wrote: > copied the /etc/munge/munge.key from the master to all the nodes. > Checked the date on master and nodes - OK > > systemctl restart slurmctld on master > > systemctl restart slurmd on all nodes Did you restart munged as well? Tha

Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Eric F. Alemany
Sorry to report that i still have the same problem. copied the /etc/munge/munge.key from the master to all the nodes. Checked the date on master and nodes - OK systemctl restart slurmctld on master systemctl restart slurmd on all nodes checked again /var/log/slurm-llnl/SlurmdLogFile.log [201

Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Eric F. Alemany
Thanks Paul. _ Eric F. Alemany System Administrator for Research Division of Radiation & Cancer Biology Department of Radiation Oncology Stanford University School of Medicine Stanford, Califor

Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Paul Edmon
Any command can be used to copy it.  We deploy ours using puppet. -Paul Edmon- On 05/07/2018 04:04 PM, Eric F. Alemany wrote: Thanks Andy. I think i omit a big step which is copying the /etc/munge/munge.key from master/headnode to all the /etc/munge/munge/key in the nodes - am i right?   i

Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Eric F. Alemany
Thanks Andy. I think i omit a big step which is copying the /etc/munge/munge.key from master/headnode to all the /etc/munge/munge/key in the nodes - am i right? i dont recall doing this so that could be the problem. Is there a specific command i need to do to copy the munge.key from the mast

Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Andy Riebs
The two most likely causes of munge complaints: 1. Different keys in /etc/munge/munge.key 2. Clocks out of sync on the nodes in question Andy On 05/07/2018 03:50 PM, Eric F. Alemany wrote: Greetings, Reminder: i am new to SLURM. When i execute  “sinfo” my nodes are down. sinfo PARTITION AV

[slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Eric F. Alemany
Greetings, Reminder: i am new to SLURM. When i execute “sinfo” my nodes are down. sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 4 down* radonc[01-04] This is what i have done so far and nothing has helped. The nodes are in “idle” state for 2-3 minute