[slurm-users] Slurm not detecting gpu after swapping out gpu

2020-04-27 Thread Dean Schulze
I replaced a Nvidia v100 with a t4. Now slurm thinks there is no gpu present: $ sudo scontrol show node fabricnode2 NodeName=fabricnode2 Arch=x86_64 CoresPerSocket=6 CPUAlloc=0 CPUTot=12 CPULoad=0.02 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:nvidia:1 NodeAddr=fabricno

Re: [slurm-users] Munge decode failing on new node

2020-04-23 Thread Dean Schulze
I went through the exercise of making the other user the same on the slurmctld as on the slurmd nodes, but that had no effect. I still have 3 nodes that have connectivity and one node where slurmd cannot contact slurmctld. That node has ssh connectivity to and from slurmctld node, but no slurm co

[slurm-users] One node won't connect and false positive messages from slurm every 1 minute 40 seconds

2020-04-22 Thread Dean Schulze
I added two new nodes to my cluster (5 nodes total including controller). One of the new nodes works, but the other one can't connect to the controller. Both new nodes were created the same way except that the one that can't connect to the controller has some extra packages installed to build slur

Re: [slurm-users] Munge decode failing on new node

2020-04-17 Thread Dean Schulze
gt; > 2. If (1) works, try this from the node running slurmctld to the > problem node > slurm-node$ echo foo | ssh node munge | unmunge > > > > *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On > Behalf Of *Dean Schulze > *Sent:* Friday, April 17,

Re: [slurm-users] Munge decode failing on new node

2020-04-17 Thread Dean Schulze
e > > > > *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On > Behalf Of *Dean Schulze > *Sent:* Friday, April 17, 2020 3:40 PM > *To:* Slurm User Community List > *Subject:* Re: [slurm-users] Munge decode failing on new node > > > > There is no ntp serv

[slurm-users] Alternative to munge for use with slurm?

2020-04-17 Thread Dean Schulze
Is there an alternative to munge when running slurm? Munge issues are a common problem in slurm, and munge doesn't give any useful information when a problem occurs. An alternative that at least gave some useful information when a problem occurs would be a big improvement. Thanks.

Re: [slurm-users] Munge decode failing on new node

2020-04-17 Thread Dean Schulze
encoding time seems odd to me > > On Wed, 15 Apr 2020 at 19:59, Dean Schulze > wrote: > >> I've installed two new nodes onto my slurm cluster. One node works, but >> the other one complains about an invalid credential for munge. I've >> verified that the

Re: [slurm-users] Munge decode failing on new node

2020-04-15 Thread Dean Schulze
> *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On > Behalf Of *Dean Schulze > *Sent:* Wednesday, April 15, 2020 1:57 PM > *To:* Slurm User Community List > *Subject:* [slurm-users] Munge decode failing on new node > > > > I've installed two n

[slurm-users] Munge decode failing on new node

2020-04-15 Thread Dean Schulze
I've installed two new nodes onto my slurm cluster. One node works, but the other one complains about an invalid credential for munge. I've verified that the munge.key is the same as on all other nodes with sudo cksum /etc/munge/munge.key I recopied a munge.key from a node that works. I've ver

[slurm-users] Is there a select plugin API that gets called when a job is or has been queued?

2020-02-27 Thread Dean Schulze
This is a code level question. I'm writing a select plugin and I want the plugin to take some action when a job is going to be or has been queued instead of run immediately. Does one of the select plugin APIs get called in either case? I was trying to check for this in select_p_job_test() but it

Re: [slurm-users] Slurm version 20.02.0 is now available

2020-02-26 Thread Dean Schulze
ns_common.la). Naturally, omitting libcons_common.a from your > plugin doesn't help if you use other functions defined in select/common. > > > > > > > On Feb 26, 2020, at 00:48 , Dean Schulze > wrote: > > > > There was a major refactoring between the 1

Re: [slurm-users] Slurm version 20.02.0 is now available

2020-02-25 Thread Dean Schulze
There was a major refactoring between the 19.05 and 20.02 code. Most of the callbacks for select plugins were moved to cons_common. I have a plugin for 19.05 that depends on two of those callbacks: select_p_job_begin() and select_p_job_fini(). My plugin is a copy of the select/cons_res plugin, b

Re: [slurm-users] Slurm version 20.02.0 is now available

2020-02-25 Thread Dean Schulze
Hi Tim, I'm very interested in the "configless" setup for slurm. Is the setup for configless documented somewhere? Dean Schulze 303.909.3245 mobile On Tue, Feb 25, 2020 at 11:57 AM Tim Wickberg wrote: > After 9 months of development and testing we are pleased to announce

Re: [slurm-users] Why does the make install path get hard coded into the slurmd binary?

2020-02-18 Thread Dean Schulze
rmd binary, however. On Tue, Feb 18, 2020 at 3:44 PM Dean Schulze wrote: > I built slurm on one machine (controller) and copied the new slurmd binary > to a node. When I started it systemctl it failed with the message: > > fatal: Unable to find slurmstepd file at > /home/dean

Re: [slurm-users] Why does the make install path get hard coded into the slurmd binary?

2020-02-18 Thread Dean Schulze
m --enable-pam > --with-pam_dir=/lib/x86_64-linux-gnu/security/ --without-shared-libslurm > > Regards, > Alex > > On Tue, Feb 18, 2020 at 2:45 PM Dean Schulze > wrote: > >> I built slurm on one machine (controller) and copied the new slurmd >> binary to a node.

[slurm-users] Why does the make install path get hard coded into the slurmd binary?

2020-02-18 Thread Dean Schulze
I built slurm on one machine (controller) and copied the new slurmd binary to a node. When I started it systemctl it failed with the message: fatal: Unable to find slurmstepd file at /home/dean/src/slurm.versions/slurm-19.05.4.build/ The path it refers to is what I gave to ./configure --prefix==

Re: [slurm-users] How should I configure a node with Autodetect=nvml?

2020-02-11 Thread Dean Schulze
gres.conf from the node. On Mon, Feb 10, 2020 at 11:41 PM Chris Samuel wrote: > On Monday, 10 February 2020 12:11:30 PM PST Dean Schulze wrote: > > > With this configuration I get this message every second in my > slurmctld.log > > file: > > > > error: _slurm_rpc_

[slurm-users] How should I configure a node with Autodetect=nvml?

2020-02-10 Thread Dean Schulze
In the gres.conf on one of my nodes I have just the line Autodetect=nvml as in the last example in https://slurm.schedmd.com/gres.conf.html. In the slurm.conf on all nodes I have this line for the node with Autodetect=nvml NodeName=slurmnode1 CPUs=16 Boards=1 SocketsPerBoard=1 CoresPerS

Re: [slurm-users] How to use Autodetect=nvml in gres.conf

2020-02-07 Thread Dean Schulze
slurm-users On Behalf Of > Stephan Roth > > Sent: Friday, February 7, 2020 2:23 AM > > To: slurm-users@lists.schedmd.com > > Subject: Re: [slurm-users] How to use Autodetect=nvml in gres.conf > > > > On 05.02.20 21:06, Dean Schulze wrote: > > > I need to d

[slurm-users] Which ports does slurm use?

2020-02-06 Thread Dean Schulze
I've moved two nodes to a different controller. The nodes are wired and the controller is networked via wifi. I had to open up ports 6817 and 6818 between the wired and wireless sides of our network to get any connectivity. Now when I do srun -N2 hostname the jobs show connection timeouts on t

[slurm-users] Nodes stuck in drain state and sending Invalid Argument every second

2020-02-06 Thread Dean Schulze
I moved two nodes to another controller and the two nodes will not come out of the drain state now. I've rebooted the hosts but they are still stuck in the drain state. There is nothing in the location given for saving state so I can't understand why a reboot doesn't clear this. Here's the node

[slurm-users] How to use Autodetect=nvml in gres.conf

2020-02-05 Thread Dean Schulze
I need to dynamically configure gpus on my nodes. The gres.conf doc says to use Autodetect=nvml in gres.conf instead of adding configuration details to each gpu in gres.conf. The docs aren't really clear about this because they show an example with the details for each gpu: AutoDetect=nvml Nam

[slurm-users] sbatch script won't accept --gres that requires more than 1 gpu

2020-02-03 Thread Dean Schulze
When I run an sbatch script with the line #SBATCH --gres=gpu:gp100:1 it runs. When I change it to #SBATCH --gres=gpu:gp100:3 it fails with "Requested node configuration is not available". But I have a node with 4 gp100s available. Here's my slurm.conf: NodeName=liqidos-dean-node1 CPUs=2 Boa

[slurm-users] How do I add a library for the linker in Makefile.in

2020-01-30 Thread Dean Schulze
I'm writing a plugin (based on the select/cons_res plugin). I need to add this library for the linker when my plugin is built: /usr/lib/x86_64-linux-gnu/libcurl.a Apparently I need to add this library to the Makefile.in. Where do I add this? Do I need to add this in the Makefile.am too? Thank

Re: [slurm-users] Question about slurm source code and libraries

2020-01-24 Thread Dean Schulze
, just > used the regular web frontend for viewing queue and node state. > > [1] https://edf-hpc.github.io/slurm-web/index.html > [2] https://edf-hpc.github.io/slurm-web/api.html > > > On Jan 24, 2020, at 1:22 PM, Dean Schulze > wrote: > > > > External Email W

[slurm-users] Question about slurm source code and libraries

2020-01-24 Thread Dean Schulze
Since there isn't a list for slurm development I'll ask here. Does the slurm code include a library for making REST calls? I'm writing a plugin that will make REST calls and if slurm already has one I'll use that, otherwise I'll find one with an appropriate open source license for my plugin. Tha

Re: [slurm-users] Can't get node out of drain state

2020-01-23 Thread Dean Schulze
: > > https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons > > Also check that slurmd daemons on the compute nodes can talk to each other > (not just to the master). e.g. bottom of > https://slurm.schedmd.com/big_sys.html > > Regards, > Alex > &g

[slurm-users] Can't get node out of drain state

2020-01-23 Thread Dean Schulze
I've tried the normal things with scontrol ( https://blog.redbranch.net/2015/12/26/resetting-drained-slurm-node/), but I have a node that will not come out of the drain state. I've also done a hard reboot and tried again. Are there any other remedies? Thanks.

[slurm-users] Spoofing a GPU on a slurm node virtual machine

2020-01-22 Thread Dean Schulze
I'm trying to spoof a gpu on a Centos 7.7 virtual machine that is a slurm node. I just want slurm to see that this node has a gpu. I'm not going to execute any code that uses a gpu. I created a character device with: mknod nvidia0 c 1 1 Here's what it looks like: [root@liqidos-dean-node1 dev]#

Re: [slurm-users] sbatch sending the working directory from the controller to the node

2020-01-21 Thread Dean Schulze
ses *srun* will propagate the current > working directory, unless *--chdir*=<*path*> is specified, in which case > *path* will become the working directory for the remote processes. > > > > William > > > > *From:* slurm-users *On Behalf Of > *Dean Schulze > *Sent:

[slurm-users] sbatch sending the working directory from the controller to the node

2020-01-21 Thread Dean Schulze
I run this sbatch script from the controller: === #!/bin/bash #SBATCH --job-name=test_job #SBATCH --mail-type=NONE# Mail events (NONE, BEGIN, END, FAIL, ALL) #SBATCH --ntasks=1 #SBATCH --mem=1gb #SBATCH --time=00:05:00 # Time limit hrs:min:sec #SBATCH --output=test_job_

Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-21 Thread Dean Schulze
AM Brian Johanson wrote: > > On 1/21/2020 12:32 AM, Chris Samuel wrote: > > On 20/1/20 3:00 pm, Dean Schulze wrote: > > > >> There's either a problem with the source code I cloned from github, > >> or there is a problem when the controller runs on Ubuntu 19

Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Dean Schulze
md is still running. Sounds possible that OOM Killer or such >> may be killing slurmd >> >> Brian Andrus >> On 1/20/2020 1:12 PM, Dean Schulze wrote: >> >> If I restart slurmd the asterisk goes away. Then I can run the job once >> and the asterisk is back, and

[slurm-users] Downgraded to slurm 19.05.4 and now slrumctld won't start because of incompatible state

2020-01-20 Thread Dean Schulze
This is what I get from systemctl status slurmctld: fatal: Can not recover last_tres state, incompatible version, got 8960 need >= 8192 <= 8704, start with '-i' to ignore this Starting it with the -i option doesn't do anything. Where does slurm store this state so I can get rid of it? Thanks.

Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Dean Schulze
eck the connectivity from the slurmctld host to the compute node (telnet > may be enough). You can also check the slurmctld logs for more information. > > Regards, > Carlos > > On Mon, 20 Jan 2020 at 21:04, Dean Schulze > wrote: > >> I've got a node running on Ce

Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Dean Schulze
lurmd on the node and > check the connectivity from the slurmctld host to the compute node (telnet > may be enough). You can also check the slurmctld logs for more information. > > Regards, > Carlos > > On Mon, 20 Jan 2020 at 21:04, Dean Schulze > wrote: > >> I'

[slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Dean Schulze
I've got a node running on CentOS 7.7 build from the recent 20.02.0pre1 code base. It's behavior is strange to say the least. The controller was built from the same code base, but on Ubuntu 19.10. The controller reports the nodes state with sinfo, but can't run a simple job with srun because it

[slurm-users] Adding to / upgrading slurm with a new plugin

2020-01-13 Thread Dean Schulze
I'm writing a select plugin for slurm, and when I do an srun command with the new plugin as the SelectType in slurm.conf I get this error: srun: error: Task launch for 32.0 failed on node slurmnode2: Header lengths are longer than data received The new plugin is just a copy of the cons_res plugin

[slurm-users] autoreconf fails because of undefined macros (creating a new plugin)

2020-01-09 Thread Dean Schulze
Following the docs for adding a new plugin, I ran the autoreconf command from the top directory. It fails with this error: $ autoreconf configure.ac:240: warning: macro 'AM_PATH_GLIB_2_0' not found in library configure.ac:246: warning: macro 'AM_PATH_GTK_2_0' not found in library configure.ac:240

[slurm-users] Can I add a new slurm plugin to an existing installation, or do I have to rebuild and reinstall with the plugin source?

2020-01-07 Thread Dean Schulze
The SchedMD docs for adding a plugin describe adding source code, Makefiles, and other modifications for a new plugin to a git branch in the source tree. This makes it sound like I would have to rebuild and reinstall slurm in order to use a new plugin that I cre

[slurm-users] Need to execute a binary with arguments on a node

2019-12-18 Thread Dean Schulze
This is a rookie question. I can use the srun command to execute a simple command like "ls" or "hostname" on a node. But I haven't found a way to add arguments like "ls -lart". What I need to do is execute a binary that takes arguments (like "a.out arg1 arg2 arg3) that exists on the node. Is s

[slurm-users] slurmd.service fails to register

2019-12-16 Thread Dean Schulze
I have my controller running (slurmctld and slrumdbd) and my controller and node host can ping each other by name so they resolve via /etc/hosts settings. When I try to start the slurmd.service it shows that it is active (running), but gives these errors: Unable to register: Zero Bytes were trans

[slurm-users] Where is the slurmstepd location configured?

2019-12-16 Thread Dean Schulze
When I try to start a node it fails with this message: fatal: Unable to find slurmstepd file at /storage/slurm-build/sbin/slurmstepd The location /storage/slurm-build/sbin/slurmstepd is where the binaries were built by make (I used ./configure --prefix=/storage/slurm-build). After I created the

[slurm-users] slurmdbd.service gives: Unable to initialize auth/munge authentication plugin

2019-12-15 Thread Dean Schulze
My slurm controller was running this week on a virtual machine on my laptop. When I try to use it while logged in on the VPN I get this error from the slurmdbd.service: Unable to initialize auth/munge authentication plugin Could this be due to having a different IP address since I'm on the VPN?

Re: [slurm-users] Need help with controller issues

2019-12-12 Thread Dean Schulze
ed include statements. I'll open another thread about those. On Tue, Dec 10, 2019 at 2:05 PM Dean Schulze wrote: > I'm trying to set up my first slurm installation following these > instructions: > > https://github.com/nateGeorge/slurm_gpu_ubuntu > > I've ha

Re: [slurm-users] Need help with controller issues

2019-12-12 Thread Dean Schulze
e RELEASE "1" | #define SLURM_VERSION_STRING "19.05.4" | /* end confdefs.h. */ | #include | int | main () | { | | MYSQL mysql; | (void) mysql_init(&mysql); | (void) mysql_close(&mysql); | | ; | return 0; | } configure:5041: WARNING: *** MySQL test program execution failed

Re: [slurm-users] Need help with controller issues

2019-12-11 Thread Dean Schulze
amd64MySQL database server binaries and system database setup ii mysql-server-core-5.7 5.7.28-0ubuntu0.18.04.4 amd64MySQL database server binaries On Tue, Dec 10, 2019 at 2:05 PM Dean Schulze wrote: > I'm trying to set

Re: [slurm-users] Need help with controller issues

2019-12-11 Thread Dean Schulze
files On Wed, Dec 11, 2019 at 9:04 AM Dean Schulze wrote: > These are the packages I installed prior to building slurm: > > libmariadb-client-lgpl-dev > libmysqlclient-dev > mariadb-server > > This installs mariadb 10.1.43 which is old. > > On the Ubuntu site (https://pac

Re: [slurm-users] Need help with controller issues

2019-12-11 Thread Dean Schulze
s Samuel wrote: > On Tuesday, 10 December 2019 1:57:59 PM PST Dean Schulze wrote: > > > This bug report from a couple of years ago indicates a source code issue: > > > > https://bugs.schedmd.com/show_bug.cgi?id=3278 > > > > This must have been fixed by now, though

Re: [slurm-users] Need help with controller issues

2019-12-10 Thread Dean Schulze
gi?id=3278 This must have been fixed by now, though. I built using slurm-19.05.2. Does anyone know if this has been fixed in 19.05.4? On Tue, Dec 10, 2019 at 2:05 PM Dean Schulze wrote: > I'm trying to set up my first slurm installation following these > instructions: > > ht

Re: [slurm-users] Need help with controller issues

2019-12-10 Thread Dean Schulze
Maybe there is some config that has to be modified to use maria instead of mysql? On Tue, Dec 10, 2019 at 2:13 PM Renfro, Michael wrote: > What do you get from > > systemctl status slurmdbd > systemctl status slurmctld > > I’m assuming at least slurmdbd isn’t running. &

[slurm-users] Need help with controller issues

2019-12-10 Thread Dean Schulze
I'm trying to set up my first slurm installation following these instructions: https://github.com/nateGeorge/slurm_gpu_ubuntu I've had to deviate a little bit because I'm using virtual machines that don't have GPUs, so I don't have a gres.conf file and in /etc/slurm/slurm.conf I don't have an ent

[slurm-users] (no subject)

2019-12-06 Thread Dean Schulze
I'm doing my first slurm installation. The schedmd docs assume that I have a cluster that meets certain (unstated) requirements available, but I don't. I've found a couple of examples showing how to setup a cluster for slurm using real hardware (nodes) with GPUs: https://github.com/mknoxnv/ubu