I have done that for several clients.

1. Staging data is a pain. The simplest thing was to have it as part of
   the job script, or have the job itself be dependent upon a separate
   staging job. Where bandwidth is an issue, we have implemented bbcp
2. Depending on size and connectivity, you can use hosts files or
   create a subdomain for the cluster nodes. I prefer the latter. Just
   use static IPs for your cloud nodes. You do need to ensure
   connectivity with networks, etc. of course
3. SchedMD has the info on cloud nodes:
   https://slurm.schedmd.com/elastic_computing.html
4. Try to isolate everything you use so it isn't overly dependent on
   some other groups services (eg: DNS, authentication, etc) unless you
   can be aware of any changes they are making so aren't surprised.
   Also, avoid network mounts on nodes. Performance takes a big hit
   when you have that going over a direct-connect or VPN.

Brian Andrus


On 12/15/2020 12:02 PM, Sajesh Singh wrote:

We are currently investigating the use of the cloud scheduling features within an on-site Slurm installation and was wondering if anyone had any experiences that they wish to share of trying to use this feature. In particular I am interested to know:

https://slurm.schedmd.com/elastic_computing.html <https://slurm.schedmd.com/elastic_computing.html>

1)  Recommendations for staging the data that was needed by the nodes in cloud

2) How did you handle name resolution

3) Any resources/documentation in particular that proved helpful while setting up the environment

4) Any bits of advise or horror stories that may be helpful in avoiding pitfalls.

Regards,

-SS-

Reply via email to