Steve, This doesn't really address your question, and I am guessing you are aware of this; however, since you did not mention it: "scontrol show job <jobid>" will give you a lot of detail about a job (a lot more than squeue). It's "Reason" is the same as sinfo and squeue, though. So no help there. I've always found that it is a bit of a detective exercise. In the end, though, there's always a reason. It's just sometimes very subtle. For example, we use "Features" so that users can constrain their jobs based on various factors (e.g., CPU architecture), and we'll sometimes have users ask for something like a "Haswell" processor and 190 GB of memory ... but we only have that much on our Skylake machines. So the "reason" can be very non-linear.
Sadly, I don't know of an easy tool that just looks at all the data and tells you or gives you better clues. I agree that that would be very helpful. As to preemptable, do you have "checkpoint" enabled via SLURM? There are situations in which a SLURM-checkpointed job will still occupy some memory, and a pending job cannot deploy because that memory is in use, even though the job was suspended. Perhaps someone on the list with more experience using the preemptable partitions/QoS *WITH* the SLURM checkpointing flag enabled could speak to this? As Steve knows, we just cancel the job when it is preempted. Paul. On Mon, Nov 26, 2018 at 3:22 AM Daan van Rossum <d.r.vanros...@gmx.de> wrote: > > I'm also interested in this. Another example: "Reason=(ReqNodeNotAvail)" is > all that a user sees in a situation when his/her job's walltime runs into a > system maintenance reservation. > > * on Friday, 2018-11-23 09:55 -0500, Steven Dick <kg4...@gmail.com> wrote: > > > I'm looking for a tool that will tell me why a specific job in the > > queue is still waiting to run. squeue doesn't give enough detail. If > > the job is held up on QOS, it's pretty obvious. But if it's > > resources, it's difficult to tell. > > > > If a job is not running because of resources, how can I identify which > > resource is not available? In a few cases, I've looked at what the > > job asked for and found a node that has those resources free, but > > still can't figure out why it isn't running. > > > > Also, if there are preemptable jobs in the queue, why is the job > > waiting on resources? Is there a priority for running jobs that can > > be compared to waiting jobs?