I have done that for several clients.
1. Staging data is a pain. The simplest thing was to have it as part of
the job script, or have the job itself be dependent upon a separate
staging job. Where bandwidth is an issue, we have implemented bbcp
2. Depending on size and connectivity, you can use hosts files or
create a subdomain for the cluster nodes. I prefer the latter. Just
use static IPs for your cloud nodes. You do need to ensure
connectivity with networks, etc. of course
3. SchedMD has the info on cloud nodes:
https://slurm.schedmd.com/elastic_computing.html
4. Try to isolate everything you use so it isn't overly dependent on
some other groups services (eg: DNS, authentication, etc) unless you
can be aware of any changes they are making so aren't surprised.
Also, avoid network mounts on nodes. Performance takes a big hit
when you have that going over a direct-connect or VPN.
Brian Andrus
On 12/15/2020 12:02 PM, Sajesh Singh wrote:
We are currently investigating the use of the cloud scheduling
features within an on-site Slurm installation and was wondering if
anyone had any experiences that they wish to share of trying to use
this feature. In particular I am interested to know:
https://slurm.schedmd.com/elastic_computing.html
<https://slurm.schedmd.com/elastic_computing.html>
1) Recommendations for staging the data that was needed by the nodes
in cloud
2) How did you handle name resolution
3) Any resources/documentation in particular that proved helpful while
setting up the environment
4) Any bits of advise or horror stories that may be helpful in
avoiding pitfalls.
Regards,
-SS-