Yes, all the nodes with those GRES types have a gres.conf with those
names/counts set...
-Original Message-
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of
Chris Samuel
Sent: Thursday, March 21, 2019 10:47 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [s
On 21/3/19 7:39 pm, Will Dennis wrote:
Why does it think that the "gres/gpu_mem_per_card" count is 0? How can I fix
this?
Did you remember to distribute gres.conf as well to the nodes?
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
I tried doing this as follows:
Node's gres.conf:
##
# Slurm's Generic Resource (GRES) configuration file
##
Name=gpu File=/dev/nvidia0 Type=1050TI
Name=gpu_mem_per_card C
Am 21.03.2019 um 23:43 schrieb Prentice Bisbal:
> Slurm-users,
>
> My users here have developed a GUI application which serves as a GUI
> interface to various physics codes they use. From this GUI, they can submit
> jobs to Slurm. On Tuesday, we upgraded Slurm from 18.08.5-2 to 18.08.6-2,and
Slurm-users,
My users here have developed a GUI application which serves as a GUI
interface to various physics codes they use. From this GUI, they can
submit jobs to Slurm. On Tuesday, we upgraded Slurm from 18.08.5-2 to
18.08.6-2,and a user has reported a problem when submitting Slurm jobs
t
On 3/21/19 4:40 PM, Reuti wrote:
Am 21.03.2019 um 16:26 schrieb Prentice Bisbal :
On 3/20/19 1:58 PM, Christopher Samuel wrote:
On 3/20/19 4:20 AM, Frava wrote:
Hi Chris, thank you for the reply.
The team that manages that cluster is not very fond of upgrading SLURM, which I
understand.
A
On 3/21/19 1:56 PM, Ryan Novosielski wrote:
On Mar 21, 2019, at 12:21 PM, Loris Bennett wrote:
Our last cluster only hit around 2.5 million jobs after
around 6 years, so database conversion was never an issue. For sites
with a higher-throughput things may be different, but I would hope tha
Prentice Bisbal
Lead Software Engineer
Princeton Plasma Physics Laboratory
http://www.pppl.gov
On 3/21/19 12:21 PM, Loris Bennett wrote:
Hi Ryan,
Ryan Novosielski writes:
On Mar 21, 2019, at 11:26 AM, Prentice Bisbal wrote:
On 3/20/19 1:58 PM, Christopher Samuel wrote:
On 3/20/19 4:20 AM
On 3/21/19 11:49 AM, Ryan Novosielski wrote:
On Mar 21, 2019, at 11:26 AM, Prentice Bisbal wrote:
On 3/20/19 1:58 PM, Christopher Samuel wrote:
On 3/20/19 4:20 AM, Frava wrote:
Hi Chris, thank you for the reply.
The team that manages that cluster is not very fond of upgrading SLURM, which I
Hi Loris,
On 3/21/19 6:21 PM, Loris Bennett
wrote:
Chris, maybe
you should look at EasyBuild
(https://easybuild.readthedocs.io/en/latest/). That way you can install
all the dependencies (such as zlib) as modules and be pretty much
independent of
> Am 21.03.2019 um 16:26 schrieb Prentice Bisbal :
>
>
> On 3/20/19 1:58 PM, Christopher Samuel wrote:
>> On 3/20/19 4:20 AM, Frava wrote:
>>
>>> Hi Chris, thank you for the reply.
>>> The team that manages that cluster is not very fond of upgrading SLURM,
>>> which I understand.
>
> As a sy
Hi Peter,
On 3/20/19 11:19 AM, Peter Steinbach
wrote:
[root@ernie
/]# scontrol show node -dd g1
NodeName=g1 CoresPerSocket=4
CPUAlloc=3 CPUTot=4 CPULoad=N/A
AvailableFeatures=(null)
ActiveFeat
There are 2 kinds of system admins: can do and can't do. You're a can
do; his are can't do.
On 3/21/19 10:26 AM, Prentice Bisbal wrote:
>
> On 3/20/19 1:58 PM, Christopher Samuel wrote:
>> On 3/20/19 4:20 AM, Frava wrote:
>>
>>> Hi Chris, thank you for the reply.
>>> The team that manages that
> On Mar 21, 2019, at 12:21 PM, Loris Bennett
> wrote:
>
> Our last cluster only hit around 2.5 million jobs after
> around 6 years, so database conversion was never an issue. For sites
> with a higher-throughput things may be different, but I would hope that
> at those places, the managers wo
On 3/21/19 6:55 AM, David Baker wrote:
it currently one of the highest priority jobs in the batch partition queue
What does squeue -j 359323 --start say?
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
On 3/21/19 9:21 AM, Loris Bennett wrote:
Chris, maybe you should look at EasyBuild
(https://easybuild.readthedocs.io/en/latest/). That way you can install
all the dependencies (such as zlib) as modules and be pretty much
independent of the ancient packages your distro may provide (other
softwar
Hi Noam,
Right, xdmod is a standard LAMP stack webapp. You can see some pictures of
the graphs in the web interface in a google image search here
https://www.google.com/search?q=xdmod&source=lnms&tbm=isch&sa=X
It may also require a fairly beefy database backend, depending on how many
millions of
> On Mar 21, 2019, at 12:38 PM, Alex Chekholko wrote:
>
> Hey Graziano,
>
> To make your decision more "data-driven", you can pipe your SLURM accounting
> logs into a tool like XDMOD which will make you pie charts of usage by user,
> group, job, gres, etc.
>
> https://open.xdmod.org/8.0/inde
Hey Graziano,
To make your decision more "data-driven", you can pipe your SLURM
accounting logs into a tool like XDMOD which will make you pie charts of
usage by user, group, job, gres, etc.
https://open.xdmod.org/8.0/index.html
You may also consider assigning this task to one of your "machine
Hi Ryan,
Ryan Novosielski writes:
>> On Mar 21, 2019, at 11:26 AM, Prentice Bisbal wrote:
>> On 3/20/19 1:58 PM, Christopher Samuel wrote:
>>> On 3/20/19 4:20 AM, Frava wrote:
>>>
Hi Chris, thank you for the reply.
The team that manages that cluster is not very fond of upgrading SLUR
> On Mar 21, 2019, at 11:26 AM, Prentice Bisbal wrote:
> On 3/20/19 1:58 PM, Christopher Samuel wrote:
>> On 3/20/19 4:20 AM, Frava wrote:
>>
>>> Hi Chris, thank you for the reply.
>>> The team that manages that cluster is not very fond of upgrading SLURM,
>>> which I understand.
>
> As a syste
Dear Slurm users,
my team is managing a HPC cluster (running Slurm) for a research
centre. We are planning to expand the cluster in the next couple of
years and we are facing a problem. We would like to put a figure on
how many resources will be needed on average for each user (in terms
of CPU cor
Hi Cyrus,
Thank you for the links. I've taken a good look through the first link (re the
cloud cluster) and the only parameter that might be relevant is
"assoc_limit_stop", but I'm not sure if that is relevant in this instance. The
reason for the delay of the job in question is "priority", how
On 3/20/19 1:58 PM, Christopher Samuel wrote:
On 3/20/19 4:20 AM, Frava wrote:
Hi Chris, thank you for the reply.
The team that manages that cluster is not very fond of upgrading
SLURM, which I understand.
As a system admin who manages clusters myself, I don't understand this.
Our job is
Hi David,
You might have a look at the thread "Large job starvation on cloud cluster"
that started on Feb 27; there's some good tidbits in there. Off the top without
more information, I would venture that settings you have in slurm.conf end up
backfilling the smaller jobs at the expense of sch
Hello,
I understand that this is not a straight forward question, however I'm
wondering if anyone has any useful ideas, please. Our cluster is busy and the
QOS has limited users to a maximum of 32 compute nodes on the "batch" queue.
Users are making good of the cluster -- for example one user
After more tests, the situation clears a bit.
If "COREs=0,1" (etc) is present in the `gres.conf` file, then one can
inject gres jobs on a single core only by using
`--gres-flags=disable-bindung` if a non-gres job is running the same node.
If "COREs=0,1" is NOT present in `gres.conf`. then any
27 matches
Mail list logo