Re: [slurm-users] ReqNodeNotAvail, but none of nodes in partition are listed.

Prentice Bisbal Mon, 07 May 2018 14:58:00 -0700

Fewer. ;)

True. What was I thinking?

sometimes even the person who set the reservation doesn’t figure it out.


Like me/us? ;)

Prentice

On 05/07/2018 05:42 PM, Ryan Novosielski wrote:

Fewer. ;)

I think rumor had it that there were plans for some improvement in this area 
(you might check the bugs or this mailing list — I can’t remember where I saw 
it, but it was awhile back now), because ReqNodeNotAvail almost never means 
something useful, and reservations don’t actually generate any message 
whatsoever that would indicate that they are there. Almost 100% of the time we 
see questions about this at our site, it’s a reservation doing it, and 
sometimes even the person who set the reservation doesn’t figure it out.

On May 7, 2018, at 5:32 PM, Prentice Bisbal <pbis...@pppl.gov> wrote:

Dang it. That's it. I recently changed the default time limit on some of my 
partitions, to only 48 hours. I have a reservation that starts on Friday at 5 
PM. These jobs are all assigned to partitions that still have longer time 
limits. I forgot that not all partitions have the new 48-hour limit.

Still, Slurm should provide a better error message for that situation, since 
I'm sure it's not that uncommon for this to happen. It would certainly result 
in a lot less tickets being sent to me.

Prentice Bisbal
Lead Software Engineer
Princeton Plasma Physics Laboratory
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.pppl.gov&data=02%7C01%7Cnovosirj%40rutgers.edu%7C813f589858f0446965e408d5b4625547%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636613256792836113&sdata=pQ2NHUJwFOkcAZOb55SQn3nOAvy1Koav7HYgr5aKrng%3D&reserved=0

On 05/07/2018 05:11 PM, Ryan Novosielski wrote:

In my experience, it may say that even if it has nothing to do with the reason 
the job isn’t running, if there are nodes on the system that aren’t available.

I assume you’ve checked for reservations?

On May 7, 2018, at 5:06 PM, Prentice Bisbal <pbis...@pppl.gov> wrote:

Dear Slurm Users,

On my cluster, I have several partitions, each with their own QOS, time limits, 
etc.

Several times today, I've received complaints from users that they submitted jobs to a 
partition with available nodes, but jobs are stuck in the PD state. I have spent the 
majority of my day investigating this, but haven't turned up anything meaningful. Both 
jobs show the "ReqNodeNotAvail" reason, but none of the nodes listed at not 
available are even in the partition these jobs are submitted to. Neither job has 
requested a specific node, either.

I have checked slurmctld.log on the server, and have not been able to find any 
clues. Any where else I should look? Any ideas what could be causing this?

--
____
|| \\UTGERS,     |---------------------------*O*---------------------------
||_// the State  |         Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ  | Office of Advanced Research Computing - MSB C630, Newark
      `'

--
____
|| \\UTGERS,     |---------------------------*O*---------------------------
||_// the State  |         Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ  | Office of Advanced Research Computing - MSB C630, Newark
      `'

Re: [slurm-users] ReqNodeNotAvail, but none of nodes in partition are listed.

Reply via email to