Thanks Andy. I think i omit a big step which is copying the /etc/munge/munge.key from master/headnode to all the /etc/munge/munge/key in the nodes - am i right? i dont recall doing this so that could be the problem.
Is there a specific command i need to do to copy the munge.key from the master/headnode to all the nodes? Thank you for your help and sorry for such “beginner” questions. Best, Eric _____________________________________________________________________________________________________ Eric F. Alemany System Administrator for Research Division of Radiation & Cancer Biology Department of Radiation Oncology Stanford University School of Medicine Stanford, California 94305 Tel:1-650-498-7969<tel:1-650-498-7969> No Texting Fax:1-650-723-7382<tel:1-650-723-7382> On May 7, 2018, at 12:57 PM, Andy Riebs <andy.ri...@hpe.com<mailto:andy.ri...@hpe.com>> wrote: The two most likely causes of munge complaints: 1. Different keys in /etc/munge/munge.key 2. Clocks out of sync on the nodes in question Andy On 05/07/2018 03:50 PM, Eric F. Alemany wrote: Greetings, Reminder: i am new to SLURM. When i execute “sinfo” my nodes are down. sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 4 down* radonc[01-04] This is what i have done so far and nothing has helped. The nodes are in “idle” state for 2-3 minutes and then there are “down” again. systemctl restart slurmd on all nodes systemctl restart slurmctld on master scontrol update node=radonc[01-04] state=UNDRAIN scontrol update node=radonc[01-04] state=IDLE I looked at the log file in /var/log/SlurmdLogFile.log and saw some “munge decode failed: Invalid credential” [2018-05-07T12:37:20.028] error: slurm_unpack_received_msg: MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Invalid credential [2018-05-07T12:37:20.028] error: slurm_unpack_received_msg: Protocol authentication error [2018-05-07T12:37:20.028] error: Munge decode failed: Invalid credential [2018-05-07T12:37:20.028] error: slurm_unpack_received_msg: MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Invalid credential [2018-05-07T12:37:20.028] error: slurm_unpack_received_msg: Protocol authentication error [2018-05-07T12:37:20.038] error: slurm_receive_msg [10.112.0.14:42140]: Unspecified error [2018-05-07T12:37:20.038] error: slurm_receive_msg [10.112.0.5:34752]: Unspecified error [2018-05-07T12:37:20.038] error: slurm_receive_msg [10.112.0.6:46746]: Unspecified error [2018-05-07T12:37:20.039] error: slurm_receive_msg [10.112.0.16:50788]: Unspecified error I ran the following command on all nodes (including master/headnode) and got “Success” munge -n | unmunge | grep STATUS STATUS: Success (0) How can I fix this problem? Thank you in advance for all your help. Eric _____________________________________________________________________________________________________ Eric F. Alemany System Administrator for Research Division of Radiation & Cancer Biology Department of Radiation Oncology Stanford University School of Medicine Stanford, California 94305 Tel:1-650-498-7969<tel:1-650-498-7969> No Texting Fax:1-650-723-7382<tel:1-650-723-7382> -- Andy Riebs andy.ri...@hpe.com<mailto:andy.ri...@hpe.com> Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE May the source be with you!