Re: [slurm-users] Slurm 21.08.8-2 upgrade

Ole Holm Nielsen Fri, 06 May 2022 01:40:55 -0700

Hi Juergen,

My upgrade report: We upgraded from 21.08.7 to 21.08.8-1 yesterday for theentire cluster, and we didn't have any issues. I built RPMs from thetar-ball and simply did "yum update" on the nodes (one partition at atime) while the cluster was running in full production mode. All slurmdget restarted during the yum update, and this happens within 1-2 minutesper partition.

Today I upgraded from 21.08.1-1 to 21.08.8-2 for the entire cluster, andagain we have not seen any issues.

We also do *not* setting CommunicationParameters=block_null_hash until alater date when there are no more old versions of slurmstepd running. Wedid however see RPC errors with "Protocol authentication error" whileblock_null_hash was enabled briefly, seehttps://bugs.schedmd.com/show_bug.cgi?id=14002, and so we turned it offagain. It hasn't happened since.


Best regards,
Ole

On 5/6/22 01:57, Juergen Salk wrote:

Hi John,

this is really bad news. We have stopped our rolling update from Slurm
21.08.6 to Slurm 21.08.8-1 today for exactly that reason: State of
compute nodes already running slurmd 21.08.8-1 suddenly started
flapping between responding and not responding but all other nodes
that were still running version 21.08.6 slurmd were not affected.

For the affected nodes we did not see any obvious reason in slurmd.log
even with SlurmdDebug set to debug3 but we noticed the following
in slurmctld.log with SlurmctldDebug=debug and DebugFlags=route
enabled.

[2022-05-05T20:37:40.449] agent/is_node_resp: node:n1423 RPC:REQUEST_PING : 
Protocol authentication error
[2022-05-05T20:37:40.449] agent/is_node_resp: node:n1424 RPC:REQUEST_PING : 
Protocol authentication error
[2022-05-05T20:37:40.449] agent/is_node_resp: node:n1425 RPC:REQUEST_PING : 
Protocol authentication error
[2022-05-05T20:37:40.449] agent/is_node_resp: node:n1426 RPC:REQUEST_PING : 
Protocol authentication error
[2022-05-05T20:37:40.449] agent/is_node_resp: node:n1811 RPC:REQUEST_PING : 
Protocol authentication error
[2022-05-05T20:37:41.397] error: Nodes n[1423-1426,1811] not responding

So you seen this as well with 21.08.8-2?

We didn't have CommunicationParameters=block_null_hash set, btw.

Actually, after Tim's last announcement, I was hoping that we can start
over tomorrow morning with 21.08.8-2 to resolve this issue. Therefore,
I would also be highly interested what others can say about rolling updates from
Slurm 21.08.6 to Slurm 21.08.8-2 which, at least temporarily, entails a
mix of patched and unpatched slurmd versions on the compute nodes.

If 21.08.8-2 slurmd still does not work together with 21.08.6 slurmd
we may have to drain the whole cluster for updating Slurm, which
is something that I'd actually wished to avoid.

Best regards
Jürgen



* Legato, John (NIH/NHLBI) [E] <lega...@nhlbi.nih.gov> [220505 22:30]:

Hello,

We are in the process of upgrading from Slurm 21.08.6 to Slurm 21.08.8-2. We’ve 
upgraded the controller and a few partitions worth of nodes. We notice the 
nodes are
losing contact with the controller but slurmd is still up. We thought that this 
issue was fixed in -2 based on this bug report:

https://bugs.schedmd.com/show_bug.cgi?id=14011

However we are still seeing the same behavior. I note that nodes running 
21.08.6 are having no issues with communication. I could
upgrade the remaining 21.08.6 nodes but hesitate to do that as it seems like it 
would completely kill the functioning nodes.

Is anyone else still seeing this in -2?

Re: [slurm-users] Slurm 21.08.8-2 upgrade

Reply via email to