Re: [slurm-users] Slurm not starting

2018-01-17 Thread Elisabetta Falivene
Ciao Gennaro! > > *NodeName=node[01-08] CPUs=16 RealMemory=16000 State=UNKNOWN* > > to > > *NodeName=node[01-08] CPUs=16 RealMemory=15999 State=UNKNOWN* > > > > Now, slurm works and the nodes are running. There is only one minor > problem > > > > *error: Node node04 has low real_memory size (7984

Re: [slurm-users] Slurm not starting

2018-01-16 Thread Gennaro Oliva
Ciao Elisabetta, On Tue, Jan 16, 2018 at 04:32:47PM +0100, Elisabetta Falivene wrote: > being again able to launch slurmctld on the master and slurmd on the nodes. great! > *NodeName=node[01-08] CPUs=16 RealMemory=16000 State=UNKNOWN* > to > *NodeName=node[01-08] CPUs=16 RealMemory=15999 State=U

Re: [slurm-users] Slurm not starting

2018-01-16 Thread Elisabetta Falivene
Here is the solution and another (minor) problem! Investigating in the direction of the pid problem I found that in the setting there was a *SlurmctldPidFile=/var/run/slurmctld.pid* *SlurmdPidFile=/var/run/slurmd.pid* but the pid was searched in /var/run/slurm-llnl so I changed in the slurm.conf

Re: [slurm-users] Slurm not starting

2018-01-16 Thread Elisabetta Falivene
> It seems like the pidfile in systemd and slurm.conf are different. Check > if they are the same and if not adjust the slurm.conf pid files. That > should prevent systemd from killing slurm. > Emh, sorry, how I can do this? > On Mon, 15 Jan 2018, 18:24 Elisabetta Falivene, > wrote: > >> The de

Re: [slurm-users] Slurm not starting

2018-01-16 Thread Elisabetta Falivene
> > slurmd: debug2: _slurm_connect failed: Connection refused >> slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817: >> Connection refused >> > > This sounds like the compute node cannot connect back to > slurmctld on the management node, you should check that the > IP address

Re: [slurm-users] Slurm not starting

2018-01-15 Thread Christopher Samuel
On 16/01/18 04:22, Elisabetta Falivene wrote: slurmd: debug2: _slurm_connect failed: Connection refused slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817: Connection refused This sounds like the compute node cannot connect back to slurmctld on the management node, you s

Re: [slurm-users] Slurm not starting

2018-01-15 Thread Carlos Fenoy
It seems like the pidfile in systemd and slurm.conf are different. Check if they are the same and if not adjust the slurm.conf pid files. That should prevent systemd from killing slurm. On Mon, 15 Jan 2018, 18:24 Elisabetta Falivene, wrote: > The deeper I go in the problem, the worser it seems..

Re: [slurm-users] Slurm not starting

2018-01-15 Thread Elisabetta Falivene
The deeper I go in the problem, the worser it seems... but maybe I'm a step closer to the solution. I discovered that munge was disabled on the nodes (my fault, Gennaro pointed out the problem before, but I enabled it back only on the master). Btw, it's very strange that the wheezy->jessie upgrade

Re: [slurm-users] Slurm not starting

2018-01-15 Thread Elisabetta Falivene
Googling a bit, the error "slurmd: fatal: Unable to determine this slurmd's NodeName" come up when you try to check slurmd on the master which shouldn't execute slurmd(?). It must be up on the nodes, not on the master. 2018-01-15 16:50 GMT+01:00 Douglas Jacobsen : > Please check your slurm.conf

Re: [slurm-users] Slurm not starting

2018-01-15 Thread Carlos Fenoy
Hi, you can not start the slurmd on the headnode. Try running the same command on the compute nodes and check the output. If there is any issue it should display the reason. Regards, Carlos On Mon, Jan 15, 2018 at 4:50 PM, Elisabetta Falivene < e.faliv...@ilabroma.com> wrote: > In the headnode.

Re: [slurm-users] Slurm not starting

2018-01-15 Thread Douglas Jacobsen
Please check your slurm.conf on the compute nodes, I'm thinking that your compute node isn't appearing in slurm.conf properly. On Jan 15, 2018 07:45, "John Hearns" wrote: > That's it. I am calling JohnH's Law: > "Any problem with a batch queueing system is due to hostname resolution" > > > On 15

Re: [slurm-users] Slurm not starting

2018-01-15 Thread Elisabetta Falivene
In the headnode. (I'm also noticing, and seems good to tell, for maybe the problem is the same, even ldap is not working as expected giving a message "invalid credential (49)" which is a message given when there are problem of this type. The update to jessie must have touched something that is affe

Re: [slurm-users] Slurm not starting

2018-01-15 Thread Carlos Fenoy
Are you trying to start the slurmd in the headnode or a compute node? Can you provide the slurm.conf file? Regards, Carlos On Mon, Jan 15, 2018 at 4:30 PM, Elisabetta Falivene < e.faliv...@ilabroma.com> wrote: > slurmd -Dvvv says > > slurmd: fatal: Unable to determine this slurmd's NodeName > >

Re: [slurm-users] Slurm not starting

2018-01-15 Thread John Hearns
That's it. I am calling JohnH's Law: "Any problem with a batch queueing system is due to hostname resolution" On 15 January 2018 at 16:30, Elisabetta Falivene wrote: > slurmd -Dvvv says > > slurmd: fatal: Unable to determine this slurmd's NodeName > > b > > 2018-01-15 15:58 GMT+01:00 Douglas Ja

Re: [slurm-users] Slurm not starting

2018-01-15 Thread Elisabetta Falivene
slurmd -Dvvv says slurmd: fatal: Unable to determine this slurmd's NodeName b 2018-01-15 15:58 GMT+01:00 Douglas Jacobsen : > The fact that sinfo is responding shows that at least slurmctld is > running. Slumd, on the other hand is not. Please also get output of > slurmd log or running "slurm

Re: [slurm-users] Slurm not starting

2018-01-15 Thread Douglas Jacobsen
The fact that sinfo is responding shows that at least slurmctld is running. Slumd, on the other hand is not. Please also get output of slurmd log or running "slurmd -Dvvv" On Jan 15, 2018 06:42, "Elisabetta Falivene" wrote: > > Anyway I suggest to update the operating system to stretch and fix

Re: [slurm-users] Slurm not starting

2018-01-15 Thread Elisabetta Falivene
> Anyway I suggest to update the operating system to stretch and fix your > configuration under a more recent version of slurm. I think I'll soon arrive to that :) b 2018-01-15 14:08 GMT+01:00 Gennaro Oliva : > Ciao Elisabetta, > > On Mon, Jan 15, 2018 at 01:13:27PM +0100, Elisabetta Falivene wr

Re: [slurm-users] Slurm not starting

2018-01-15 Thread Williams, Jenny Avis
com/> From: Elisabetta Falivene Sent: Monday, January 15, 2018 7:14 AM To: Slurm User Community List Subject: [slurm-users] Slurm not starting I did an upgrade from wheezy to jessie (automatically with a normal dist-upgrade) on a cluster with 8 nodes (up, running and reachable) an

Re: [slurm-users] Slurm not starting

2018-01-15 Thread Gennaro Oliva
Ciao Elisabetta, On Mon, Jan 15, 2018 at 01:13:27PM +0100, Elisabetta Falivene wrote: > Error messages are not much helping me in guessing what is going on. What > should I check to get what is failing? check slurmctld.log and slurmd.log, you can find them under /var/log/slurm-llnl > *PARTITION

[slurm-users] Slurm not starting

2018-01-15 Thread Elisabetta Falivene
I did an upgrade from wheezy to jessie (automatically with a normal dist-upgrade) on a cluster with 8 nodes (up, running and reachable) and from slurm 2.3.4 to 14.03.9. Overcame some problems booting kernel (thank you vey much to Gennaro Oliva, btw), now the system is running correctly with kernel