Hi Will,
On Wed, Nov 08, 2017 at 10:01:31PM -0500, Will L wrote:
> SlurmUser=wlandau
you need change this to:
SlurmUser=slurm
Best regards
--
Gennaro Oliva
"complete network wide network outage tomorrow night from 10pm across the
whole institute".
^^
Lachlan, I advise running the following script on all login nodes:
#!/bin/bash
#
cat << EOF > /etc/motd
HPC Managers are in the pub.
At this hour of the day you should also be
In your situation, where you're blocking user access to the login node, it
probably doesn't matter. We use DOWN in most events, as INACTIVE would prevent
new jobs from being queued against the partition at all. DOWN allows the jobs
to be queued, and just doesn't permit them to run. (In either ca
Hi,
I am debugging slurmd on a worker node with gdb, and I was wondering if there
was a way to disable the socket timeouts between slurmctld and slurmd so that
my jobs don't fail while I'm stepping through code.
Thanks
---
Thanks for the suggestions. Munge seems to be working just fine. At one
point I tried to build SLURM from the source, but when I could not make it
work, I `sudo make uninstall`ed it and opted for the pre-built apt version
all over again. Maybe that made a mess. What should I do to make SLURM
notice
On 09/11/17 11:00, Lachlan Musicman wrote:
> I've just discovered that the partitions have a state, and it can be set
> to UP, DOWN, DRAIN or INACTIVE.
DRAIN the partitions to stop new jobs running, then you can work on how
you suspend running jobs (good luck with that!).
--
Christopher Samuel
We use something like this:
scontrol create reservation starttime=2017-11-08T06:00:00 duration=1440
user=root flags=maint,ignore_jobs nodes=ALL
Reservation
created: root_2
Then confirm:
scontrol show reservation
ReservationName=root_2 StartTime=2017-11-08T06:00:00
EndTime=2017-11-09T06:00:0
On 9 November 2017 at 10:54, Elisabetta Falivene
wrote:
> I am the admin and I have no documentation :D I'll try The third option.
> Thank you very much
>
Ah. Yes. Well, you will need some sort of drive shared between all the
nodes so that they can read and write from a common space.
Also, I re
The IT team sent an email saying "complete network wide network outage
tomorrow night from 10pm across the whole institute".
Our plan is to put all queued jobs on hold, suspend all running jobs, and
turning off the login node.
I've just discovered that the partitions have a state, and it can be s
I am the admin and I have no documentation :D I'll try The third option.
Thank you very much
Il giovedì 9 novembre 2017, Lachlan Musicman ha scritto:
> On 9 November 2017 at 10:35, Elisabetta Falivene > wrote:
>
>> Wow, thank you. There's a way to check which directories the master and
>> The n
On 9 November 2017 at 10:35, Elisabetta Falivene
wrote:
> Wow, thank you. There's a way to check which directories the master and
> The nodes share?
>
There's no explicit way.
1. Check the cluster documentation written by the cluster admins
2. Ask the cluster admins
3. Run "mount" or "cat /etc/m
Wow, thank you. There's a way to check which directories the master and The
nodes share?
Il mercoledì 8 novembre 2017, Lachlan Musicman ha
scritto:
> On 9 November 2017 at 09:19, Elisabetta Falivene > wrote:
>
>> I'm getting this message anytime I try to execute any job on my cluster.
>> (node
On 9 November 2017 at 09:19, Elisabetta Falivene
wrote:
> I'm getting this message anytime I try to execute any job on my cluster.
> (node01 is the name of my first of eight nodes and is up and running)
>
> Trying a python simple script:
> *root@mycluster:/tmp# srun python test.py *
> *slurmd[nod
Hi Jeff,
Quite close:
$ sinfo --Format=nodehost,statelong
Cheers,
--
Kilian
I'm getting this message anytime I try to execute any job on my cluster.
(node01 is the name of my first of eight nodes and is up and running)
Trying a python simple script:
*root@mycluster:/tmp# srun python test.py *
*slurmd[node01]: error: task/cgroup: unable to build job physical cores*
*/usr/b
I use
alias sn='sinfo -Nle -o "%.20n %.15C %.8O %.7t" | uniq'
and then it's just
[root@machine]# sn
cheers
L.
--
"The antidote to apocalypticism is *apocalyptic civics*. Apocalyptic civics
is the insistence that we cannot ignore the truth, nor should we panic
about it. It is a shared consc
Thanks, Paul. I've played with SchedulerParameters=defer,... in and out of
the configuration per various suggestions in various SLURM bug tracker
threads that I looked at, but this was probably when we were still focusing
on trying to get sched/backfill playing ball. I will try again now that
we're
Subject says it all. Is there a way to get a list of nodes, their
status, and NOT have duplicate entries in the output? This is what I
have so far but it seems to duplicate nodes if they exist in more than 1
partition, which is true of all my nodes.
sinfo --Node --Format=nodelist,statelong
So hangups like this can occur due to the slurmdbd being busy with
requests. I've seen that happen when an ill timed massive sacct request
hits when slurmdbd is doing its roll up. In that case the slurmctld
hangs while slurmdbd is busy. Typically in this case restarting
mysql/slurmdbd seems
Hi all,
I see SLURM 17.02.9 slurmctld hang or become unresponsive every few days
with the message in syslog:
server_thread_count over limit (256), waiting
I believe from the user perspective they see "Socket timed out on send/recv
operation". Slurmctld never seems to recover once it's in this st
Hi,
Sorry, to clarify, when the RPM spec file is used, it separates out the
slurm/crypto_munge.so slurm plugin into the slurm-munge RPM. I wasn't sure
if a debian package preparation did similar. To me, the log output
indicates that slurm/crypto_munge.so does not exist. If you are using a
./con
On 11/8/17 3:01 PM, Douglas Jacobsen wrote:
Also please make sure you have the slurm-munge package installed (at
least for the RPMs this is the name of the package, I'm unsure if that
packaging layout was conserved for Debian)
nope, it's just "munge"
Regards,
Benjamin
--
FSU Jena | JULIELab.de
Hi Will,
On Wed, Nov 08, 2017 at 01:38:18PM +, Will L wrote:
> $ sudo slurmctld -D -f /etc/slurm-llnl/slurm.conf
> slurmctld: slurmctld version 17.02.9 started on cluster cluster
> slurmctld: error: Couldn't find the specified plugin name for crypto/munge
> looking at all files
> slurmctld: er
Hello i try to develop a node feature plugin and i have problem.
when i write this :
srun -w computer122 -p my_partition -C hot hostname
in my plugin i want to get "hot" parameters when i start this job
but what function can be use for this ?
i think that i use node_features_p_node_state(char **a
Also please make sure you have the slurm-munge package installed (at least
for the RPMs this is the name of the package, I'm unsure if that packaging
layout was conserved for Debian)
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center
it looks like munge is not correctly configured, or you have some kind of
permission problems. This manual explains how to configure and test it.
https://github.com/dun/munge/wiki/Installation-Guide
good luck!
2017-11-08 14:38 GMT+01:00 Will L :
> Benjamin,
>
>
> Thanks for following up. I just
Benjamin,
Thanks for following up. I just tried again as you said, with the following
result.
$ sudo slurmctld -D -f /etc/slurm-llnl/slurm.conf
slurmctld: slurmctld version 17.02.9 started on cluster cluster
slurmctld: error: Couldn't find the specified plugin name for crypto/munge
looking at al
Date: Tue, 7 Nov 2017 11:19:32 +0100
From: Benjamin Redling
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] Having errors trying to run a packed jobs
script
Message-ID: <6979a04b-c9c0-badd-b57b-34d4d0ec8...@uni-jena.de>
Content-Type: text/plain; charset=UTF-8
Hi Benjamin,
T
28 matches
Mail list logo