Re: [slurm-users] Multi-node job failure

2019-12-10 Thread Ree, Jan-Albert van
OK so OpenMPI works fine. That means SLURM, OFED and hardware are fine. Which mvapich2 package are you using, a home built one or one provided by Bright ? Regards, -- Jan-Albert Jan-Albert van Ree | Linux System Administrator | Digital Services MARIN | T +31 317 49 35 48 | j.a.v@marin.n

Re: [slurm-users] Need help with controller issues

2019-12-10 Thread William Brown
The latest MariaDB packaging is different, there is a 3rd RPM needed, as well as the client and developer. Away from my desk but the info is on the MariaDB site. William On Wed, 11 Dec 2019, 05:23 Chris Samuel, wrote: > On Tuesday, 10 December 2019 1:57:59 PM PST Dean Schulze wrote: > > > This

Re: [slurm-users] Need help with controller issues

2019-12-10 Thread Chris Samuel
On Tuesday, 10 December 2019 1:57:59 PM PST Dean Schulze wrote: > This bug report from a couple of years ago indicates a source code issue: > > https://bugs.schedmd.com/show_bug.cgi?id=3278 > > This must have been fixed by now, though. > > I built using slurm-19.05.2. Does anyone know if this

Re: [slurm-users] Multi-node job failure

2019-12-10 Thread Chris Samuel
Hi Chris, On Tuesday, 10 December 2019 11:49:44 AM PST Chris Woelkers - NOAA Federal wrote: > Test jobs, submitted via sbatch, are able to run on one node with no problem > but will not run on multiple nodes. The jobs are using mpirun and mvapich2 > is installed. Is there a reason why you aren'

Re: [slurm-users] Multi-node job failure

2019-12-10 Thread Paul Kenyon
Hi Chris, Your issue sounds similar to a case I ran into once, where I could run jobs on a few nodes, but once it spanned more than a handful it would fail. In that particular case, we figured out that it was due to broadcast storm protection being enabled on the cluster switch. When the first n

Re: [slurm-users] Multi-node job failure

2019-12-10 Thread Chris Woelkers - NOAA Federal
Thanks for the reply and the things to try. Here are the answers to your questions/tests in order: - I tried mpiexec and the same issue occurred. - While the job is listed as running I checked all the nodes. None of them have processes spawned. I have no idea on the hydra process. - I have version

Re: [slurm-users] Need help with controller issues

2019-12-10 Thread Dean Schulze
There's a problem with accounting_storage/mysql plugin: $ sudo slurmdbd -D - slurmdbd: debug: Log file re-opened slurmdbd: pidfile not locked, assuming no running daemon slurmdbd: debug3: Trying to load plugin /usr/lib/slurm/auth_munge.so slurmdbd: debug: Munge authentication plugin loaded

Re: [slurm-users] Multi-node job failure

2019-12-10 Thread Ree, Jan-Albert van
We're running multiple clusters using Bright 8.x with Scientific Linux 7 (and have run Scientific Linux releases 5 and 6 with Bright 5.0 and higher in the past without issues on many different pieces of hardware) and never experienced this. But some things to test : - some implementations pref

Re: [slurm-users] Need help with controller issues

2019-12-10 Thread Dean Schulze
$ systemctl status slurmdbd ● slurmdbd.service - Slurm DBD accounting daemon Loaded: loaded (/etc/systemd/system/slurmdbd.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Tue 2019-12-10 13:33:28 MST; 40min ago Process: 787 ExecStart=/usr/sbin/slurmdbd $SLUR

Re: [slurm-users] Need help with controller issues

2019-12-10 Thread Renfro, Michael
What do you get from systemctl status slurmdbd systemctl status slurmctld I’m assuming at least slurmdbd isn’t running. > On Dec 10, 2019, at 3:05 PM, Dean Schulze wrote: > > External Email Warning > This email originated from outside the university. Please use caution when > opening attachme

[slurm-users] Need help with controller issues

2019-12-10 Thread Dean Schulze
I'm trying to set up my first slurm installation following these instructions: https://github.com/nateGeorge/slurm_gpu_ubuntu I've had to deviate a little bit because I'm using virtual machines that don't have GPUs, so I don't have a gres.conf file and in /etc/slurm/slurm.conf I don't have an ent

[slurm-users] Multi-node job failure

2019-12-10 Thread Chris Woelkers - NOAA Federal
I have a 16 node HPC that is in the process of being upgraded from CentOS 6 to 7. All nodes are diskless and connected via 1Gbps Ethernet and FDR Infiniband. I am using Bright Cluster Management to manage it and their support has not found a solution to this problem. For the most part the cluster i

Re: [slurm-users] SLURM_TMPDIR

2019-12-10 Thread Juergen Salk
Hi Angelines, we create a job specific scratch directory in the prolog script but use the task_prolog script to set the environment variable. In prolog: scratch_dir=/your/path /bin/mkdir -p ${scratch_dir} /bin/chmod 700 ${scratch_dir} /bin/chown ${SLURM_JOB_USER} ${scratch_dir} In task_prolog: