On 02.02.22 18:32, Michael Di Domenico wrote:
On Mon, Jan 31, 2022 at 3:57 PM Stephan Roth wrote:
The problem is to identify the cards physically from the information we
have, like what's reported with nvidia-smi or available in
/proc/driver/nvidia/gpus/*/information
The serial number isn't sh
Hi ,
My slurm version is 20.11.5 .
I use job_container/tmpfs to set up a private /tmp ,but the permission is
700. Normal user can not read or write.
drwx-- 2 root root 6 Jan 31 01:32 tmp
[cid:06118f60-5631-4735-a6f1-d80480cfb2cf]
I think the permission should be 70
I actually just did that path for a children's hospital.
It was fairly straight-forward. Running jobs were not affected.
You do need to go 17->18->19->20->21
This is because there were changes in the db schema.
If you plan on bringing everything to a stop (no running jobs), you
should be good
Are you running slurmdbd in your current setup? If you are then the upgrade
path there might have additional considerations moving this far in versions.
--
Brian D. Haymore
University of Utah
Center for High Performance Computing
155 South 1452 East RM 405
Salt Lake City, Ut 84112
Phone: 801-558
The "Upgrades" section of the quick-start guide [0] warns:
> Slurm permits upgrades to a new major release from the past two major
> releases, which happen every nine months (e.g. 20.02.x or 20.11.x to
> 21.08.x) without loss of jobs or other state information. State
> information from older ver
Does anyone have a working example using PreemptExemptTime?
My goal is to make a higher priority job wait 24 hours before actually
preempting a lower priority job. Another way, any job is entitled to 24
hours run time before being preempted. The preempted job should be
suspended, ideally. If r
Hello , Thank you for your suggestion and I thank also thank Tina;
To answer your question, there is no TreeWidth entry in the slurm.conf
But it seems we figured out the issue and I'm so sorry we did not
think about it : we already had a pool of 48 nodes on the master but
their slurm.conf
On Mon, Jan 31, 2022 at 3:57 PM Stephan Roth wrote:
> The problem is to identify the cards physically from the information we
> have, like what's reported with nvidia-smi or available in
> /proc/driver/nvidia/gpus/*/information
> The serial number isn't shown for every type of GPU and I'm not sure
Hi Jeremy,
What is the value of TreeWidth in your slurm.conf? If there is no entry
then I recommend setting it to a value a bit larger than the number of
nodes you have in your cluster and then restarting slurmctld.
Best,
Steve
On Wed, Feb 2, 2022 at 12:59 AM Jeremy Fix
wrote:
> Hi,
>
> A fol
Hi Jeremy,
I haven't got anything very intelligent to contribute to solve your problem.
However, what I can tell you is that we run our production cluster with
one SLURM master running on a virtual machine handling just over 300
nodes. We have never seen the sort of problem you have other than
10 matches
Mail list logo