Set slurm.conf parameter: EnforcePartLimits=ANY or NO may help this, not sure.
Best, Feng Best, Feng On Thu, Sep 21, 2023 at 11:27 AM Jason Simms <jsim...@swarthmore.edu> wrote: > > I personally don't think that we should assume users will always know which > partitions are available to them. Ideally, of course, they would, but I think > it's fine to assume users should be able to submit a list of partitions that > they would be fine running their jobs on, and if one is forbidden for > whatever reason, Slurm just selects another one of the choices. I'd expect > similar behavior if a particular partition were down or had been removed; as > long as there is an acceptable specified partition available, run it there, > and don't kill the job. Seems really reasonable to me. > > Jason > > On Thu, Sep 21, 2023 at 10:40 AM David <dr...@umich.edu> wrote: >> >> That's not at all how I interpreted this man page description. By "If the >> job can use more than..." I thought it was completely obvious (although >> perhaps wrong, if your interpretation is correct, but it never crossed my >> mind) that it referred to whether the _submitting user_ is OK with it using >> more than one partition. The partition where the user is forbidden (because >> of the partition's allowed account) should just be _not_ the earliest >> initiation (because it'll never initiate there), and therefore not run >> there, but still be able to run on the other partitions listed in the batch >> script. >> >> > that's fair. I was considering this only given the fact that we know the >> > user doesn't have access to a partition (this isn't the surprise here) and >> > that slurm communicates that as the reason pretty clearly. I can see how >> > if a user is submitting against multiple partitions they might hope that >> > if a job couldn't run in a given partition, given the number of others >> > provided, the scheduler might consider all of those *before* dying >> > outright at the first rejection. >> >> On Thu, Sep 21, 2023 at 10:28 AM Bernstein, Noam CIV USN NRL (6393) >> Washington DC (USA) <noam.bernst...@nrl.navy.mil> wrote: >>> >>> On Sep 21, 2023, at 9:46 AM, David <dr...@umich.edu> wrote: >>> >>> Slurm is working as it should. From your own examples you proved that; by >>> not submitting to b4 the job works. However, looking at man sbatch: >>> >>> -p, --partition=<partition_names> >>> Request a specific partition for the resource allocation. >>> If not specified, the default behavior is to allow the slurm controller to >>> select >>> the default partition as designated by the system >>> administrator. If the job can use more than one partition, specify their >>> names in a comma >>> separate list and the one offering earliest initiation will >>> be used with no regard given to the partition name ordering (although >>> higher pri‐ >>> ority partitions will be considered first). When the job is >>> initiated, the name of the partition used will be placed first in the job >>> record >>> partition string. >>> >>> In your example, the job can NOT use more than one partition (given the >>> restrictions defined on the partition itself precluding certain accounts >>> from using it). This, to me, seems either like a user education issue (i.e. >>> don't have them submit to every partition), or you can try the job submit >>> lua route - or perhaps the hidden partition route (which I've not tested). >>> >>> >>> That's not at all how I interpreted this man page description. By "If the >>> job can use more than..." I thought it was completely obvious (although >>> perhaps wrong, if your interpretation is correct, but it never crossed my >>> mind) that it referred to whether the _submitting user_ is OK with it using >>> more than one partition. The partition where the user is forbidden (because >>> of the partition's allowed account) should just be _not_ the earliest >>> initiation (because it'll never initiate there), and therefore not run >>> there, but still be able to run on the other partitions listed in the batch >>> script. >>> >>> I think it's completely counter-intuitive that submitting saying it's OK to >>> run on one of a few partitions, and one partition happening to be forbidden >>> to the submitting user, means that it won't run at all. What if you list >>> multiple partitions, and increase the number of nodes so that there aren't >>> enough in one of the partitions, but not realize this problem? Would you >>> expect that to prevent the job from ever running on any partition? >>> >>> Noam >> >> >> >> -- >> David Rhey >> --------------- >> Advanced Research Computing >> University of Michigan > > > > -- > Jason L. Simms, Ph.D., M.P.H. > Manager of Research Computing > Swarthmore College > Information Technology Services > (610) 328-8102 > Schedule a meeting: https://calendly.com/jlsimms