Set slurm.conf parameter: EnforcePartLimits=ANY or NO may help this, not sure.

Best,

Feng

Best,

Feng


On Thu, Sep 21, 2023 at 11:27 AM Jason Simms <jsim...@swarthmore.edu> wrote:
>
> I personally don't think that we should assume users will always know which 
> partitions are available to them. Ideally, of course, they would, but I think 
> it's fine to assume users should be able to submit a list of partitions that 
> they would be fine running their jobs on, and if one is forbidden for 
> whatever reason, Slurm just selects another one of the choices. I'd expect 
> similar behavior if a particular partition were down or had been removed; as 
> long as there is an acceptable specified partition available, run it there, 
> and don't kill the job. Seems really reasonable to me.
>
> Jason
>
> On Thu, Sep 21, 2023 at 10:40 AM David <dr...@umich.edu> wrote:
>>
>> That's not at all how I interpreted this man page description.  By "If the 
>> job can use more than..." I thought it was completely obvious (although 
>> perhaps wrong, if your interpretation is correct, but it never crossed my 
>> mind) that it referred to whether the _submitting user_ is OK with it using 
>> more than one partition. The partition where the user is forbidden (because 
>> of the partition's allowed account) should just be _not_ the earliest 
>> initiation (because it'll never initiate there), and therefore not run 
>> there, but still be able to run on the other partitions listed in the batch 
>> script.
>>
>> > that's fair. I was considering this only given the fact that we know the 
>> > user doesn't have access to a partition (this isn't the surprise here) and 
>> > that slurm communicates that as the reason pretty clearly. I can see how 
>> > if a user is submitting against multiple partitions they might hope that 
>> > if a job couldn't run in a given partition, given the number of others 
>> > provided, the scheduler might consider all of those *before* dying 
>> > outright at the first rejection.
>>
>> On Thu, Sep 21, 2023 at 10:28 AM Bernstein, Noam CIV USN NRL (6393) 
>> Washington DC (USA) <noam.bernst...@nrl.navy.mil> wrote:
>>>
>>> On Sep 21, 2023, at 9:46 AM, David <dr...@umich.edu> wrote:
>>>
>>> Slurm is working as it should. From your own examples you proved that; by 
>>> not submitting to b4 the job works. However, looking at man sbatch:
>>>
>>>        -p, --partition=<partition_names>
>>>               Request  a  specific partition for the resource allocation.  
>>> If not specified, the default behavior is to allow the slurm controller to 
>>> select
>>>               the default partition as designated by the system 
>>> administrator. If the job can use more than one partition, specify their 
>>> names  in  a  comma
>>>               separate  list and the one offering earliest initiation will 
>>> be used with no regard given to the partition name ordering (although 
>>> higher pri‐
>>>               ority partitions will be considered first).  When the job is 
>>> initiated, the name of the partition used will be placed first in the job  
>>> record
>>>               partition string.
>>>
>>> In your example, the job can NOT use more than one partition (given the 
>>> restrictions defined on the partition itself precluding certain accounts 
>>> from using it). This, to me, seems either like a user education issue (i.e. 
>>> don't have them submit to every partition), or you can try the job submit 
>>> lua route - or perhaps the hidden partition route (which I've not tested).
>>>
>>>
>>> That's not at all how I interpreted this man page description.  By "If the 
>>> job can use more than..." I thought it was completely obvious (although 
>>> perhaps wrong, if your interpretation is correct, but it never crossed my 
>>> mind) that it referred to whether the _submitting user_ is OK with it using 
>>> more than one partition. The partition where the user is forbidden (because 
>>> of the partition's allowed account) should just be _not_ the earliest 
>>> initiation (because it'll never initiate there), and therefore not run 
>>> there, but still be able to run on the other partitions listed in the batch 
>>> script.
>>>
>>> I think it's completely counter-intuitive that submitting saying it's OK to 
>>> run on one of a few partitions, and one partition happening to be forbidden 
>>> to the submitting user, means that it won't run at all.  What if you list 
>>> multiple partitions, and increase the number of nodes so that there aren't 
>>> enough in one of the partitions, but not realize this problem?  Would you 
>>> expect that to prevent the job from ever running on any partition?
>>>
>>> Noam
>>
>>
>>
>> --
>> David Rhey
>> ---------------
>> Advanced Research Computing
>> University of Michigan
>
>
>
> --
> Jason L. Simms, Ph.D., M.P.H.
> Manager of Research Computing
> Swarthmore College
> Information Technology Services
> (610) 328-8102
> Schedule a meeting: https://calendly.com/jlsimms

Reply via email to