Yes, QoS's are dynamic.
-Paul Edmon-
On 8/30/19 2:58 PM, Guillaume Perrault Archambault wrote:
Hi Paul,
Thanks for your pointers.
I'll looking into QOS and MCS after my paper deadline (Sept 5). Re
QOS, as expressed to Peter in the reply I just now sent, I wonder if
it the QOS of a job can be change while it's pending (submitted but
not yet running).
Regards,
Guillaume.
On Fri, Aug 30, 2019 at 10:24 AM Paul Edmon <ped...@cfa.harvard.edu
<mailto:ped...@cfa.harvard.edu>> wrote:
A QoS is probably your best bet. Another variant might be MCS, which
you can use to help reduce resource fragmentation. For limits though
QoS will be your best bet.
-Paul Edmon-
On 8/30/19 7:33 AM, Steven Dick wrote:
> It would still be possible to use job arrays in this situation, it's
> just slightly messy.
> So the way a job array works is that you submit a single script, and
> that script is provided an integer for each subjob. The integer
is in
> a range, with a possible step (default=1).
>
> To run the situation you describe, you would have to
predetermine how
> many of each test you want to run (i.e., you coudln't dynamically
> change the number of jobs that run within one array)., and a master
> script would map the integer range to the job that was to be
started.
>
> The most trivial way to do it would be to put the list of
regressions
> in a text file and the master script would index it by line
number and
> then run the appropriate command.
> A more complex way would be to do some math (a divide?) to get the
> script name and subindex (modulus?) for each regression.
>
> Both of these would require some semi-advanced scripting, but
nothing
> that couldn't be cut and pasted with some trivial modifications for
> each job set.
>
> As to the unavailability of the admin ...
> An alternate approach that would require the admin's help would
be to
> come up with a small set of alocations (e.g., 40 gpus, 80 gpus, 100
> gpus, etc.) and make a QOS for each one with a gpu limit (e.g.,
> maxtrespu=gpu=40 ) Then the user would assign that QOS to the
job when
> starting it to set the overall allocation for all the jobs. The
admin
> woudln't need to tweak this except once, you just pick which
tweak to
> use.
>
> On Fri, Aug 30, 2019 at 2:36 AM Guillaume Perrault Archambault
> <gperr...@uottawa.ca <mailto:gperr...@uottawa.ca>> wrote:
>> Hi Steven,
>>
>> Thanks for taking the time to reply to my post.
>>
>> Setting a limit on the number of jobs for a single array isn't
sufficient because regression-tests need to launch multiple
arrays, and I would need a job limit that would take effect over
all launched jobs.
>>
>> It's very possible I'm not understand something. I'll lay out a
very specific example in the hopes you can correct me if I've gone
wrong somewhere.
>>
>> Let's take the small cluster with 140 GPUs and no fairshare as
an example, because it's easier for me to explain.
>>
>> The users, who all know each other personally and interact via
chat, decide on a daily basis how many jobs each user can run at a
time.
>>
>> Let's say today is Sunday (hypothetically). Nobody is actively
developing today, except that user 1 has 10 jobs running for the
entire weekend. That leaves 130 GPUs unused.
>>
>> User 2, whose jobs all run on 1 GPU decides to run a regression
test. The regression test comprises of 9 different scripts each
run 40 times, for a grand total of 360 jobs. The duration of the
scripts vary from 1 and 5 hours to complete, and the jobs take on
average 4 hours to complete.
>>
>> User 2 gets the user group's approval (via chat) to use 90 GPUs
(so that 40 GPUs will remain for anyone else wanting to work that
day).
>>
>> The problem I'm trying to solve is this: how do I ensure that
user 2 launches his 360 jobs in such a way that 90 jobs are in the
run state consistently until the regression test is finished?
>>
>> Keep in mind that:
>>
>> limiting each job array to 10 jobs is inefficient: when the
first job array finishes (long before the last one), only 80 GPUs
will be used, and so on as other arrays finish
>> the admin is not available, he cannot be asked to set a hard
limit of 90 jobs for user 2 just for today
>>
>> I would be happy to use job arrays if they allow me to set an
overarching job limit across multiple arrays. Perhaps this is
doable. Admttedly I'm working on a paper to be submitted in a few
days, so I don't have time to test jobs arrays thoroughly, but I
will try out job arrays more thoroughly once I've submitted my
paper (ie after sept 5).
>>
>> My solution, for now, is to not use job arrays. Instead, I
launch each job individually, and I use singleton (by launching
all jobs with the same 90 unique names) to ensure that exactly 90
jobs are run at a time (in this case, corresponding to 90 GPUs in
use).
>>
>> Side note: the unavailability of the admin might sound
contrived by picking Sunday as an example, but it's in fact very
typical. The admin is not available:
>>
>> on weekends (the present example)
>> at any time outside of 9am to 5pm (keep in mind, this is a
cluster used by students in different time zones)
>> any time he is on vacation
>> anytime the he is looking after his many other
responsibilities. Constantly setting user limits that change on a
daily basis would be too much too ask.
>>
>>
>> I'd be happy if you corrected my misunderstandings, especially
if you could show me how to set a job limit that takes effect over
multiple job arrays.
>>
>> I may have very glaring oversights as I don't necessarily have
a big picture view of things (I've never been an admin, most
notably), so feel free to poke holes at the way I've constructed
things.
>>
>> Regards,
>> Guillaume.
>>
>>
>> On Fri, Aug 30, 2019 at 1:22 AM Steven Dick <kg4...@gmail.com
<mailto:kg4...@gmail.com>> wrote:
>>> This makes no sense and seems backwards to me.
>>>
>>> When you submit an array job, you can specify how many jobs
from the
>>> array you want to run at once.
>>> So, an administrator can create a QOS that explicitly limits
the user.
>>> However, you keep saying that they probably won't modify the
system
>>> for just you...
>>>
>>> That seems to me to be the perfect case to use array jobs and
tell it
>>> how many elements of the array to run at once.
>>> You're not using array jobs for exactly the wrong reason.
>>>
>>> On Tue, Aug 27, 2019 at 1:19 PM Guillaume Perrault Archambault
>>> <gperr...@uottawa.ca <mailto:gperr...@uottawa.ca>> wrote:
>>>> The reason I don't use job arrays is to be able limit the
number of jobs per users