[slurm-users] slurmd-used libs in an NFS share?

Paul Brunk Thu, 23 Mar 2023 06:53:28 -0700

Hi all!

In short, I'm thinking about housing some slurmd-used libs in an NFS
share, and am curious about the risk such sharedness offers to
job-running slurmds (not concerned about the jobs themselves here).


For our next Slurm deployment (not a rolling upgrade), our Rocky8
nodes will be 'statelite' (xCAT), as they are already in our CentOS7
cluster.  They NFS mount a shared root image, which includes Slurm in
/opt.  A separate NFS server provides user home dirs and the
"/usr/local"-like dir we call "/apps".  The /scratch lives in Lustre.

We're going to use CUDA, and non-distro versions of PMIx and hwloc.
For the sake of the RAM-dwelling OS image size on the nodes, I'd like
for these to live in the /apps NFS share, while keeping Slurm in the
OS image "/opt".  This would make CUDA and PMIx and hwloc unavailable
to node slurmds, in the event that the /apps mount fails and the OS
"/" mount does not.  I won't care if slurmd can't start a job at such
times, since the user apps would be unavailable anyway (and our NHC
checks for that).  But is there some risk to the slurmd parents of the
already-running jobs, if those slurmds need to (re-)access those
libraries while they're unavailable?

I've looked at e.g. Nvidia's DeepOps (puts CUDA in an unshared
/usr/local, replicated on each node), and Dell's Omnia (puts CUDA in
an NFS share), Nathan Rini's Docker-scale-out cluster (puts CUDA,
etc. in an unshared /usr/local, replicated on each node), and OpenHPC
(Slurm is in /usr, hwloc in /opt (NFS-shared)).  I've started
deploying a dev Omnivector (thanks, Mike Hanby!) using LXD, to see
what they do, but haven't finished that.

Thanks.  I've seen a few "I'm starting a Slurm cluster" walkthrough
threads online lately, but haven't seen this particular thing
addressed.  I'm aware it might be a non-issue.

--
Paul Brunk, system administrator
Advanced Computing Resource Center
Enterprise IT Svcs, the University of Georgia

[slurm-users] slurmd-used libs in an NFS share?

Reply via email to