Re: [slurm-users] [slurm-dev] Re: Installing SLURM locally on Ubuntu 16.04

2017-11-08 Thread Gennaro Oliva
Hi Will, On Wed, Nov 08, 2017 at 10:01:31PM -0500, Will L wrote: > SlurmUser=wlandau you need change this to: SlurmUser=slurm Best regards -- Gennaro Oliva

Re: [slurm-users] Quick hold on all partitions, all jobs

2017-11-08 Thread John Hearns
"complete network wide network outage tomorrow night from 10pm across the whole institute". ^^ Lachlan, I advise running the following script on all login nodes: #!/bin/bash # cat << EOF > /etc/motd HPC Managers are in the pub. At this hour of the day you should also be

Re: [slurm-users] Quick hold on all partitions, all jobs

2017-11-08 Thread Jonathon A Anderson
In your situation, where you're blocking user access to the login node, it probably doesn't matter. We use DOWN in most events, as INACTIVE would prevent new jobs from being queued against the partition at all. DOWN allows the jobs to be queued, and just doesn't permit them to run. (In either ca

[slurm-users] Disable socket timeouts for debugging

2017-11-08 Thread Dave Sizer
Hi, I am debugging slurmd on a worker node with gdb, and I was wondering if there was a way to disable the socket timeouts between slurmctld and slurmd so that my jobs don't fail while I'm stepping through code. Thanks ---

Re: [slurm-users] [slurm-dev] Re: Installing SLURM locally on Ubuntu 16.04

2017-11-08 Thread Will L
Thanks for the suggestions. Munge seems to be working just fine. At one point I tried to build SLURM from the source, but when I could not make it work, I `sudo make uninstall`ed it and opted for the pre-built apt version all over again. Maybe that made a mess. What should I do to make SLURM notice

Re: [slurm-users] Quick hold on all partitions, all jobs

2017-11-08 Thread Christopher Samuel
On 09/11/17 11:00, Lachlan Musicman wrote: > I've just discovered that the partitions have a state, and it can be set > to UP, DOWN, DRAIN or INACTIVE. DRAIN the partitions to stop new jobs running, then you can work on how you suspend running jobs (good luck with that!). -- Christopher Samuel

Re: [slurm-users] Quick hold on all partitions, all jobs

2017-11-08 Thread Stradling, Alden Reid (ars9ac)
We use something like this: scontrol create reservation starttime=2017-11-08T06:00:00 duration=1440 user=root flags=maint,ignore_jobs nodes=ALL Reservation created: root_2 Then confirm: scontrol show reservation ReservationName=root_2 StartTime=2017-11-08T06:00:00 EndTime=2017-11-09T06:00:0

Re: [slurm-users] Error running jobs with srun

2017-11-08 Thread Lachlan Musicman
On 9 November 2017 at 10:54, Elisabetta Falivene wrote: > I am the admin and I have no documentation :D I'll try The third option. > Thank you very much > Ah. Yes. Well, you will need some sort of drive shared between all the nodes so that they can read and write from a common space. Also, I re

[slurm-users] Quick hold on all partitions, all jobs

2017-11-08 Thread Lachlan Musicman
The IT team sent an email saying "complete network wide network outage tomorrow night from 10pm across the whole institute". Our plan is to put all queued jobs on hold, suspend all running jobs, and turning off the login node. I've just discovered that the partitions have a state, and it can be s

Re: [slurm-users] Error running jobs with srun

2017-11-08 Thread Elisabetta Falivene
I am the admin and I have no documentation :D I'll try The third option. Thank you very much Il giovedì 9 novembre 2017, Lachlan Musicman ha scritto: > On 9 November 2017 at 10:35, Elisabetta Falivene > wrote: > >> Wow, thank you. There's a way to check which directories the master and >> The n

Re: [slurm-users] Error running jobs with srun

2017-11-08 Thread Lachlan Musicman
On 9 November 2017 at 10:35, Elisabetta Falivene wrote: > Wow, thank you. There's a way to check which directories the master and > The nodes share? > There's no explicit way. 1. Check the cluster documentation written by the cluster admins 2. Ask the cluster admins 3. Run "mount" or "cat /etc/m

Re: [slurm-users] Error running jobs with srun

2017-11-08 Thread Elisabetta Falivene
Wow, thank you. There's a way to check which directories the master and The nodes share? Il mercoledì 8 novembre 2017, Lachlan Musicman ha scritto: > On 9 November 2017 at 09:19, Elisabetta Falivene > wrote: > >> I'm getting this message anytime I try to execute any job on my cluster. >> (node

Re: [slurm-users] Error running jobs with srun

2017-11-08 Thread Lachlan Musicman
On 9 November 2017 at 09:19, Elisabetta Falivene wrote: > I'm getting this message anytime I try to execute any job on my cluster. > (node01 is the name of my first of eight nodes and is up and running) > > Trying a python simple script: > *root@mycluster:/tmp# srun python test.py * > *slurmd[nod

Re: [slurm-users] Get list of nodes and their status, one node per line, no duplicates

2017-11-08 Thread Kilian Cavalotti
Hi Jeff, Quite close: $ sinfo --Format=nodehost,statelong Cheers, -- Kilian

[slurm-users] Error running jobs with srun

2017-11-08 Thread Elisabetta Falivene
I'm getting this message anytime I try to execute any job on my cluster. (node01 is the name of my first of eight nodes and is up and running) Trying a python simple script: *root@mycluster:/tmp# srun python test.py * *slurmd[node01]: error: task/cgroup: unable to build job physical cores* */usr/b

Re: [slurm-users] Get list of nodes and their status, one node per line, no duplicates

2017-11-08 Thread Lachlan Musicman
I use alias sn='sinfo -Nle -o "%.20n %.15C %.8O %.7t" | uniq' and then it's just [root@machine]# sn cheers L. -- "The antidote to apocalypticism is *apocalyptic civics*. Apocalyptic civics is the insistence that we cannot ignore the truth, nor should we panic about it. It is a shared consc

Re: [slurm-users] SLURM 17.02.9 slurmctld unresponsive with server_thread_count over limit, waiting in syslog

2017-11-08 Thread Sean Caron
Thanks, Paul. I've played with SchedulerParameters=defer,... in and out of the configuration per various suggestions in various SLURM bug tracker threads that I looked at, but this was probably when we were still focusing on trying to get sched/backfill playing ball. I will try again now that we're

[slurm-users] Get list of nodes and their status, one node per line, no duplicates

2017-11-08 Thread Jeff White
Subject says it all.  Is there a way to get a list of nodes, their status, and NOT have duplicate entries in the output?  This is what I have so far but it seems to duplicate nodes if they exist in more than 1 partition, which is true of all my nodes. sinfo --Node --Format=nodelist,statelong

Re: [slurm-users] SLURM 17.02.9 slurmctld unresponsive with server_thread_count over limit, waiting in syslog

2017-11-08 Thread Paul Edmon
So hangups like this can occur due to the slurmdbd being busy with requests.  I've seen that happen when an ill timed massive sacct request hits when slurmdbd is doing its roll up.  In that case the slurmctld hangs while slurmdbd is busy.  Typically in this case restarting mysql/slurmdbd seems

[slurm-users] SLURM 17.02.9 slurmctld unresponsive with server_thread_count over limit, waiting in syslog

2017-11-08 Thread Sean Caron
Hi all, I see SLURM 17.02.9 slurmctld hang or become unresponsive every few days with the message in syslog: server_thread_count over limit (256), waiting I believe from the user perspective they see "Socket timed out on send/recv operation". Slurmctld never seems to recover once it's in this st

Re: [slurm-users] [slurm-dev] Re: Installing SLURM locally on Ubuntu 16.04

2017-11-08 Thread Douglas Jacobsen
Hi, Sorry, to clarify, when the RPM spec file is used, it separates out the slurm/crypto_munge.so slurm plugin into the slurm-munge RPM. I wasn't sure if a debian package preparation did similar. To me, the log output indicates that slurm/crypto_munge.so does not exist. If you are using a ./con

Re: [slurm-users] [slurm-dev] Re: Installing SLURM locally on Ubuntu 16.04

2017-11-08 Thread Benjamin Redling
On 11/8/17 3:01 PM, Douglas Jacobsen wrote: Also please make sure you have the slurm-munge package installed (at least for the RPMs this is the name of the package, I'm unsure if that packaging layout was conserved for Debian) nope, it's just "munge" Regards, Benjamin -- FSU Jena | JULIELab.de

Re: [slurm-users] [slurm-dev] Re: Installing SLURM locally on Ubuntu 16.04

2017-11-08 Thread Gennaro Oliva
Hi Will, On Wed, Nov 08, 2017 at 01:38:18PM +, Will L wrote: > $ sudo slurmctld -D -f /etc/slurm-llnl/slurm.conf > slurmctld: slurmctld version 17.02.9 started on cluster cluster > slurmctld: error: Couldn't find the specified plugin name for crypto/munge > looking at all files > slurmctld: er

[slurm-users] node feature plugin, what use function for get features ?

2017-11-08 Thread Tueur Volvo
Hello i try to develop a node feature plugin and i have problem. when i write this : srun -w computer122 -p my_partition -C hot hostname in my plugin i want to get "hot" parameters when i start this job but what function can be use for this ? i think that i use node_features_p_node_state(char **a

Re: [slurm-users] [slurm-dev] Re: Installing SLURM locally on Ubuntu 16.04

2017-11-08 Thread Douglas Jacobsen
Also please make sure you have the slurm-munge package installed (at least for the RPMs this is the name of the package, I'm unsure if that packaging layout was conserved for Debian) Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center

Re: [slurm-users] [slurm-dev] Re: Installing SLURM locally on Ubuntu 16.04

2017-11-08 Thread Manuel Rodríguez Pascual
it looks like munge is not correctly configured, or you have some kind of permission problems. This manual explains how to configure and test it. https://github.com/dun/munge/wiki/Installation-Guide good luck! 2017-11-08 14:38 GMT+01:00 Will L : > Benjamin, > > > Thanks for following up. I just

Re: [slurm-users] [slurm-dev] Re: Installing SLURM locally on Ubuntu 16.04

2017-11-08 Thread Will L
Benjamin, Thanks for following up. I just tried again as you said, with the following result. $ sudo slurmctld -D -f /etc/slurm-llnl/slurm.conf slurmctld: slurmctld version 17.02.9 started on cluster cluster slurmctld: error: Couldn't find the specified plugin name for crypto/munge looking at al

Re: [slurm-users] Having errors trying to run a packed jobs script

2017-11-08 Thread Marius Cetateanu
Date: Tue, 7 Nov 2017 11:19:32 +0100 From: Benjamin Redling To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Having errors trying to run a packed jobs script Message-ID: <6979a04b-c9c0-badd-b57b-34d4d0ec8...@uni-jena.de> Content-Type: text/plain; charset=UTF-8 Hi Benjamin, T