Re: [slurm-users] [External] Re: Staging data on the nodes one will be processing on via sbatch

Prentice Bisbal Mon, 05 Apr 2021 16:23:47 -0700

I think this is exactly the type of use case heterogeneous job supportis for, which has been supported since Slurm 17.11

Slurm version 17.11 and later supports the ability to submit andmanage heterogeneous jobs, in which each component has virtually alljob options available including partition, account and QOS (Quality OfService). For example, part of a job might require four cores and 4 GBfor each of 128 tasks while another part of the job would require 16GB of memory and one CPU.


https://slurm.schedmd.com/heterogeneous_jobs.html

Using this, you should be able to use a single core for the transferfrom NFS , use all the cores/GPUs you need for the computation, and thenuse 1 single core to transfer back to NFS:


Disclaimer: I've never used this feature myself.

Prentice

On 4/3/21 5:31 PM, Fulcomer, Samuel wrote:

inline below...
On Sat, Apr 3, 2021 at 4:50 PM Will Dennis <wden...@nec-labs.com<mailto:wden...@nec-labs.com>> wrote:
    Sorry, obvs wasn’t ready to send that last message yet…

    Our issue is the shared storage is via NFS, and the “fast storage
    in limited supply” is only local on each node. Hence the need to
    copy it over from NFS (and then remove it when finished with it.)

    I also wanted the copy & remove to be different jobs, because the
    main processing job usually requires GPU gres, which is a
    time-limited resource on the partition. I don’t want to tie up the
    allocation of GPUs while the data is staged (and removed), and if
    the data copy fails, don’t want to even progress to the job where
    the compute happens (so like, copy_data_locally && process_data)
...yup... this is the problem. We've invested in GPFS and an NVMeExcelero pool (for initial placement); however, we still have theproblem of having users pull down data from community repositoriesbefore running useful computation.
Your question has gotten me thinking about this more. In our case, allof our nodes are diskless, so this wouldn't really work for us (but wedo have fast GPFS), but.... if your fast storage is only local to yournodes, the subsequent compute jobs will need to request those specificnodes, so you'll need to have a mechanism to increase the SLURMscheduling "weight" of the nodes after staging, so the schedulerwon't select them over nodes with a lower weight. That could be donein a job epilog.
        If you've got other fast storage in limited supply that can be
        used for data that can be staged, then by all means use it,
        but consider whether you want batch cpu cores tied up with the
        wall time of transferring the data. This could easily be done
        on a time-shared frontend login node from which the users
        could then submit (via script) jobs after the data was staged.
        Most of the transfer wallclock is in network wait, so don't
        waste dedicated cores for it.

Re: [slurm-users] [External] Re: Staging data on the nodes one will be processing on via sbatch

Reply via email to