Hi Everyone,

We have a challenge with scheduling jobs in a partition comprised of 
heterogenous nodes with respect to memory and cores [1]. We further use cores 
as the unit of measure for charging users. Currently we implement a crude 
mechanism of using MaxMemPerCore as a proxy for memory use, to charge for 
memory use. In the partition in question, we have nodes with 256GB, 384GB and 
768GB of RAM. The 384GB and 256GB nodes have different core counts, but work 
out close to ~9GB/core; the 768 nodes are roughly ~18GB/core.  The default 
memory request for the partition is set to this same amount, and will remain 
unchanged.  This partition is really for HTC, so the max node limit is set to 2 
and will remain there (the parallel partition is homogenous).

So, if we increase that MaxMemPerCore number, we'll potentially have a lot of 
nodes with un-schedulable cores (no memory left) and if we leave it where it 
is, the extra 384GB in the larger nodes won't ever get used. Of these two, the 
former is preferable, even though the charge for memory is effectively halved 
(that's fine, most allocations are monopoly money anyway). We really just want 
to optimize job placement for throughput without having to create a separate 
partition.

What we're concerned about is this: we don't believe the scheduler will be 
smart about job placement - placing larger memory jobs preferentially on nodes 
with more total memory and smaller memory jobs on the smaller memory nodes. To 
address this, we're thinking of just weighting the smaller memory nodes more, 
so that jobs get placed there first, and only get bumped to the larger memory 
nodes when there are larger memory requests and when the smaller nodes are 
already full.

We'd also like this scheme to limit backfill of small jobs on the larger nodes. 
Ideally, if we can get this to work, we'd extend it be getting rid of the 
"largemem" (1-3TB nodes) partition and putting those nodes into this single 
partition (many of our largemem users could easily fit individual jobs <768GB). 
 I have had good results on a small cluster of very heterogenous nodes all in 
one large partition with just letting the scheduler handle things, and it 
worked reasonably well, with the exception of some very large (bordering on 
--exclusive or explicitly --exclusive) jobs starving because of small job 
backfill.

Has anyone (everyone?) tried to deal with this? We're going to go ahead and try 
out this scheme (it seems pretty straightforward), but I wanted to get a sense 
of what other installations are doing.

Best,

Scott Ruffner
University of Virginia Research Computing

[1] Our cluster grows sort of organically as we have an unpredictable budget, 
and can't plan for forklift replacement of partitions (nodes) on regular 
lifecycle periods.

--
Scott Ruffner
Senior HPC Engineer
UVa Research Computing
(434)924-6778(o)
(434)295-0250(h)
sruff...@virginia.edu

Reply via email to