Max,

Well, that was just an example. I also was doing the same things with like 125 
node runs. Obviously, that's beyond a rack and your chances of hitting a down 
node increases! I guess I figured if I can under-specify a nodelist, maybe I 
could over-specify it!

It's possible that using the topology bits could do it, but I'm also just a 
basic end-user with not too much knowledge of the system setup. This was the 
first time I even tried --nodelist (with help from the admins) so I was 
approaching it naively as you saw. 

Or, I suppose, is there a flag that one can pass to sbatch that gives the user 
a warning? That is: 

   Dear user, the allocation requested contains a node in a downed state. 
   This allocation will be PENDING for a while. You might want to rethink this.

I guess SLURM knows all the downed nodes, so maybe? (But then again, maybe 
sbatch would get magnitudes slower if it had to query a database of all nodes 
and do checks to see this...)

Matt

PS: Or I guess I could stare at Ole's cool SLURM tools and figure out a way to 
have my own "job checker". Find the downed nodes, parse 'scontrol show job 
1234', and display possible/impossible jobs :D

-- 
Matt Thompson, SSAI, Ld Scientific Programmer/Analyst
NASA GSFC,    Global Modeling and Assimilation Office
Code 610.1,  8800 Greenbelt Rd,  Greenbelt,  MD 20771
Phone: 301-614-6712                 Fax: 301-614-6246
http://science.gsfc.nasa.gov/sed/bio/matthew.thompson

On 7/14/21, 1:42 PM, "slurm-users on behalf of Max Voit" 
<slurm-users-boun...@lists.schedmd.com on behalf of 
max.voit_m...@with-eyes.net> wrote:

    On Wed, 14 Jul 2021 17:04:45 +0000
    "Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]"
    <matthew.thomp...@nasa.gov> wrote:

    > Namely, I needed say, 20 nodes on a cluster on the same rack
    > ...
    > So, my question is, is there a way to say, "Please give me X nodes
    > inside this specific range of nodes?"

    Is the requirement actually the nodes being in the same rack, or rather
    being connected to the same switch? For the latter: If you specify a
    topology.conf file you can use --switch=... , c.f.
    
https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Ftopology.html&amp;data=04%7C01%7Cmatthew.thompson%40nasa.gov%7C3769db7ffe274ecca56a08d946eeb9cc%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637618813398627878%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=nT%2FVHMkQeLZwnQq3ynNNDbvDjlmJ9e%2F6%2FyYVS89%2BMhc%3D&amp;reserved=0

    Best,
    Max


  • [slurm-u... Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]
    • Re:... Max Voit
      • ... Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]
        • ... Em Dragowsky

Reply via email to