Hello,

maybe some additional notes:

While the cited procedure works great in general, it gets more complicated for heterogeneous setups, i.e. if you have several GPU types defined in gres.conf, since the 'tres_per_<x>' fields can then take the form of either 'gres:gpu:N' or 'gres:gpu:<type>:N' - depending on whether the job script specifies a GPU type or not. Of course, you could omit the GPU type definition in gres.conf and define the type as a node feature instead, as long as no nodes contain multiple different GPU types. Since the latter is the case in our cluster, I instead opted to check only for the existence of 'gpu' in the 'tres_per_<x>' fields and to not bother with parsing the actual number of GPUs. However, there is an interesting edge case here, as users are free to set --gpus=0 - either one has to filter for that specifically, or instruct one's users to not do that.

Kind Regards,
René Sitt

Am 29.03.23 um 08:57 schrieb Ward Poelmans:

Hi,

We have a dedicated partitions for GPUs (their name ends with _gpu) and simply forbid a job that is not requesting GPU resources to use this partition:

local function job_total_gpus(job_desc)
    -- return total number of GPUs allocated to the job
    -- there are many ways to request a GPU. This comes from the job_submit example in the slurm source     -- a GPU resource is either nil or "gres:gpu:N", with N the number of GPUs requested

    -- pick relevant job resources for GPU spec (undefined resources can show limit values)
    gpu_specs = {
        ['tres_per_node'] = 1,
        ['tres_per_task'] = 1,
        ['tres_per_socket'] = 1,
        ['tres_per_job'] = 1,
    }

    -- number of nodes
    if job_desc['min_nodes'] < 0xFFFFFFFE then gpu_specs['tres_per_node'] = job_desc['min_nodes'] end
    -- number of tasks
    if job_desc['num_tasks'] < 0xFFFFFFFE then gpu_specs['tres_per_task'] = job_desc['num_tasks'] end
    -- number of sockets
    if job_desc['sockets_per_node'] < 0xFFFE then gpu_specs['tres_per_socket'] = job_desc['sockets_per_node'] end     gpu_specs['tres_per_socket'] = gpu_specs['tres_per_socket'] * gpu_specs['tres_per_node']

    gpu_options = {}
    for tres_name, _ in pairs(gpu_specs) do
        local num_gpus = string.match(tostring(job_desc[tres_name]), "^gres:gpu:([0-9]+)") or 0
        gpu_options[tres_name] = tonumber(num_gpus)
    end
    -- calculate total GPUs
    for tres_name, job_res in pairs(gpu_specs) do
        local num_gpus = gpu_options[tres_name]
        if num_gpus > 0 then
            total_gpus = num_gpus * tonumber(job_res)
            return total_gpus
        end
    end
    return 0
end



function slurm_job_submit(job_desc, part_list, submit_uid)
    local total_gpus = job_total_gpus(job_desc)
    slurm.log_debug("Job total number of GPUs: %s", tostring(total_gpus));

    if total_gpus == 0 then
        for partition in string.gmatch(tostring(job_desc.partition), '([^,]+)') do
            if string.match(partition, '_gpu$') then
                slurm.log_user(string.format('ERROR: GPU partition %s is not allowed for non-GPU jobs.', partition))
                return ESLURM_INVALID_GRES
            end
        end
    end

    return slurm.SUCCESS
end



Ward

On 29/03/2023 01:24, Frank Pari wrote:
Well, I wanted to avoid using lua.  But, it looks like that's going to be the easiest way to do this without having to create a separate partition for the GPUs. Basically, check for at least one gpu in the job submission and if none exclude all GPU nodes for the job.

image.png

Now I'm wondering how to auto-gen the list of nodes with GPUs, so I don't have to remember to update job_submit.lua everytime we get new GPU nodes.

-F

On Tue, Mar 28, 2023 at 4:06 PM Frank Pari <pa...@bc.edu <mailto:pa...@bc.edu>> wrote:

    Hi all,

    First, thank you all for participating in this list.  I've learned so much by just following in other's threads.  =)

    I'm looking at creating a scavenger partition with idle resources from CPU and GPU nodes and I'd like to keep this to one partition.  But, I don't want CPU only jobs using up resources on the GPU nodes.

    I've seen suggestions for job/lua scripts.  But, I'm wondering if there's any other way to ensure a job has requested at least 1 gpu for the scheduler to assign that job to a GPU node.

    Thanks in advance!

    -Frank


--
Dipl.-Chem. René Sitt
Hessisches Kompetenzzentrum für Hochleistungsrechnen
Philipps-Universität Marburg
Hans-Meerwein-Straße
35032 Marburg

Tel. +49 6421 28 23523
si...@hrz.uni-marburg.de
www.hkhlr.de

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to