Hi Marcus,
On 5/5/22 14:45, Marcus Boden wrote:
we had a similar issues on our systems. As I understand from the bug you
linked, we just need to wait until all the old jobs are finished (and the
old slurmstepd are gone). So a full drain should not be necessary?
Yes, I believe that sounds right.
I've been thinking about how to determine the timestamp of the oldest job
running on the cluster, and then make sure this is after the time that all
slurmd daemons were upgraded to 21.08.8.
This command will tell you the oldest running jobs:
$ squeue -t running -O StartTime | sort | head
You can add more -O options to get JobIDs etc., as long as you sort on the
StartTime column (Slurm ISO 8601 timestamps[1] can simply be sorted in
lexicographical order).
I hope this helps.
/Ole
[1] https://en.wikipedia.org/wiki/ISO_8601
On 05.05.22 13:53, Ole Holm Nielsen wrote:
Just a heads-up regarding setting
CommunicationParameters=block_null_hash in slurm.conf:
On 5/4/22 21:50, Tim Wickberg wrote:
CVE-2022-29500:
An architectural flaw with how credentials are handled can be exploited
to allow an unprivileged user to impersonate the SlurmUser account.
Access to the SlurmUser account can be used to execute arbitrary
processes as root.
This issue impacts all Slurm releases since at least Slurm 1.0.0.
Systems remain vulnerable until all slurmdbd, slurmctld, and slurmd
processes have been restarted in the cluster.
Once all daemons have been upgraded sites are encouraged to add
"block_null_hash" to CommunicationParameters. That new option provides
additional protection against a potential exploit.
The block_null_hash still needs to be documented in the slurm.conf
man-page. But in https://bugs.schedmd.com/show_bug.cgi?id=14002 I was
assured that it's OK to use it now.
I upgraded 21.08.7 to 21.08.8 using RPM packages while the cluster was
running production jobs. This is perhaps not recommended (see
https://slurm.schedmd.com/quickstart_admin.html#upgrade), but it worked
without a glitch also in this case.
However, when I defined CommunicationParameters=block_null_hash in
slurm.conf later today, I started getting RPC errors on the compute
nodes and in slurmctld when jobs were completing, see bug 14002.
I would recommend sites to hold up a bit with
CommunicationParameters=block_null_hash until we have found a resolution
in bug 14002. Draining all jobs from the cluster before setting this
parameter may be the safe approach(?).