and it looks like i'll have to wait till 20.11 for a fix https://bugs.schedmd.com/show_bug.cgi?id=9035
On Wed, Aug 26, 2020 at 11:20 AM Michael Di Domenico <mdidomeni...@gmail.com> wrote: > > looks like a similar issue is being tracked by: > https://bugs.schedmd.com/show_bug.cgi?id=9441 > > On Wed, Aug 26, 2020 at 11:04 AM Michael Di Domenico > <mdidomeni...@gmail.com> wrote: > > > > sorry i meant to say, our slurm nodehealth script pushed the node to > > failed state. slurm itself wasn't doing this > > > > On Wed, Aug 26, 2020 at 11:02 AM Michael Di Domenico > > <mdidomeni...@gmail.com> wrote: > > > > > > i just upgraded from v18 to v20. Did something change in the node > > > config validation? it used to be that if i started slurm on a compute > > > node that had lower than expected memory or was missing gpu's, slurm > > > would push a node into a failed state that i could see in sinfo -R. > > > now it seems to be logging every second in the slurmctld > > > "slurm_rpc_node_registration invalid argument" log file for each node > > > that's broken > > > > > > Is there some function that got disabled/changed? i use slurm to > > > ferret out bad hardware, but logging to the logfile every seconds > > > seems silly and since i don't routinely watch the log files things > > > will go unnoticed