Hi Alex, Thanks. The issue is that we don't know where they'll end up running in the heterogenous environment. In addition, because the limit is applied by GrpTRES=cpu=N, someone buying 100 cores today shouldn't get access to 130 of todays cores.
Regards, Sam On Wed, Jun 19, 2019 at 3:41 PM Alex Chekholko <a...@calicolabs.com> wrote: > Hey Samuel, > > Can't you just adjust the existing "cpu" limit numbers using those same > multipliers? Someone bought 100 CPUs 5 years ago, now that's ~70 CPUs. > > Or vice versa, someone buys 100 CPUs today, they get a setting of 130 CPUs > because the CPUs are normalized to the old performance. Since it would > probably look bad politically to reduce someone's number, but giving a new > customer a larger number should be fine. > > Regards, > Alex > > On Wed, Jun 19, 2019 at 12:32 PM Fulcomer, Samuel < > samuel_fulco...@brown.edu> wrote: > >> >> (...and yes, the name is inspired by a certain OEM's software licensing >> schemes...) >> >> At Brown we run a ~400 node cluster containing nodes of multiple >> architectures (Sandy/Ivy, Haswell/Broadwell, and Sky/Cascade) purchased in >> some cases by University funds and in others by investigator funding >> (~50:50). They all appear in the default SLURM partition. We have 3 >> classes of SLURM users: >> >> >> 1. Exploratory - no-charge access to up to 16 cores >> 2. Priority - $750/quarter for access to up to 192 cores (and with a >> GrpTRESRunMins=cpu limit). Each user has their own QoS >> 3. Condo - an investigator group who paid for nodes added to the >> cluster. The group has its own QoS and SLURM Account. The QoS allows use >> of >> the number of cores purchased and has a much higher priority than the QoS' >> of the "priority" users. >> >> The first problem with this scheme is that condo users who have purchased >> the older hardware now have access to the newest without penalty. In >> addition, we're encountering resistance to the idea of turning off their >> hardware and terminating their condos (despite MOUs stating a 5yr life). >> The pushback is the stated belief that the hardware should run until it >> dies. >> >> What I propose is a new TRES called a Processor Performance Unit (PPU) >> that would be specified on the Node line in slurm.conf, and used such that >> GrpTRES=ppu=N was calculated as the number of allocated cores multiplied by >> their associated PPU numbers. >> >> We could then assign a base PPU to the oldest hardware, say, "1" for >> Sandy/Ivy and increase for later architectures based on performance >> improvement. We'd set the condo QoS to GrpTRES=ppu=N*X+M*Y,..., where N is >> the number of cores of the oldest architecture multiplied by the configured >> PPU/core, X, and repeat for any newer nodes/cores the investigator has >> purchased since. >> >> The result is that the investigator group gets to run on an approximation >> of the performance that they've purchased, rather on the raw purchased core >> count. >> >> Thoughts? >> >> >>