Ciao Gennaro!
> > *NodeName=node[01-08] CPUs=16 RealMemory=16000 State=UNKNOWN*
> > to
> > *NodeName=node[01-08] CPUs=16 RealMemory=15999 State=UNKNOWN*
> >
> > Now, slurm works and the nodes are running. There is only one minor
> problem
> >
> > *error: Node node04 has low real_memory size (7984
Ciao Elisabetta,
On Tue, Jan 16, 2018 at 04:32:47PM +0100, Elisabetta Falivene wrote:
> being again able to launch slurmctld on the master and slurmd on the nodes.
great!
> *NodeName=node[01-08] CPUs=16 RealMemory=16000 State=UNKNOWN*
> to
> *NodeName=node[01-08] CPUs=16 RealMemory=15999 State=U
Here is the solution and another (minor) problem!
Investigating in the direction of the pid problem I found that in the
setting there was a
*SlurmctldPidFile=/var/run/slurmctld.pid*
*SlurmdPidFile=/var/run/slurmd.pid*
but the pid was searched in /var/run/slurm-llnl so I changed in the
slurm.conf
> It seems like the pidfile in systemd and slurm.conf are different. Check
> if they are the same and if not adjust the slurm.conf pid files. That
> should prevent systemd from killing slurm.
>
Emh, sorry, how I can do this?
> On Mon, 15 Jan 2018, 18:24 Elisabetta Falivene,
> wrote:
>
>> The de
>
> slurmd: debug2: _slurm_connect failed: Connection refused
>> slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817:
>> Connection refused
>>
>
> This sounds like the compute node cannot connect back to
> slurmctld on the management node, you should check that the
> IP address
On 16/01/18 04:22, Elisabetta Falivene wrote:
slurmd: debug2: _slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at
192.168.1.1:6817: Connection refused
This sounds like the compute node cannot connect back to
slurmctld on the management node, you s
It seems like the pidfile in systemd and slurm.conf are different. Check if
they are the same and if not adjust the slurm.conf pid files. That should
prevent systemd from killing slurm.
On Mon, 15 Jan 2018, 18:24 Elisabetta Falivene,
wrote:
> The deeper I go in the problem, the worser it seems..
The deeper I go in the problem, the worser it seems... but maybe I'm a step
closer to the solution.
I discovered that munge was disabled on the nodes (my fault, Gennaro
pointed out the problem before, but I enabled it back only on the master).
Btw, it's very strange that the wheezy->jessie upgrade
Googling a bit, the error "slurmd: fatal: Unable to determine this slurmd's
NodeName" come up when you try to check slurmd on the master which
shouldn't execute slurmd(?). It must be up on the nodes, not on the master.
2018-01-15 16:50 GMT+01:00 Douglas Jacobsen :
> Please check your slurm.conf
Hi,
you can not start the slurmd on the headnode. Try running the same command
on the compute nodes and check the output. If there is any issue it should
display the reason.
Regards,
Carlos
On Mon, Jan 15, 2018 at 4:50 PM, Elisabetta Falivene <
e.faliv...@ilabroma.com> wrote:
> In the headnode.
Please check your slurm.conf on the compute nodes, I'm thinking that your
compute node isn't appearing in slurm.conf properly.
On Jan 15, 2018 07:45, "John Hearns" wrote:
> That's it. I am calling JohnH's Law:
> "Any problem with a batch queueing system is due to hostname resolution"
>
>
> On 15
In the headnode. (I'm also noticing, and seems good to tell, for maybe the
problem is the same, even ldap is not working as expected giving a message
"invalid credential (49)" which is a message given when there are problem
of this type. The update to jessie must have touched something that is
affe
Are you trying to start the slurmd in the headnode or a compute node?
Can you provide the slurm.conf file?
Regards,
Carlos
On Mon, Jan 15, 2018 at 4:30 PM, Elisabetta Falivene <
e.faliv...@ilabroma.com> wrote:
> slurmd -Dvvv says
>
> slurmd: fatal: Unable to determine this slurmd's NodeName
>
>
That's it. I am calling JohnH's Law:
"Any problem with a batch queueing system is due to hostname resolution"
On 15 January 2018 at 16:30, Elisabetta Falivene
wrote:
> slurmd -Dvvv says
>
> slurmd: fatal: Unable to determine this slurmd's NodeName
>
> b
>
> 2018-01-15 15:58 GMT+01:00 Douglas Ja
slurmd -Dvvv says
slurmd: fatal: Unable to determine this slurmd's NodeName
b
2018-01-15 15:58 GMT+01:00 Douglas Jacobsen :
> The fact that sinfo is responding shows that at least slurmctld is
> running. Slumd, on the other hand is not. Please also get output of
> slurmd log or running "slurm
The fact that sinfo is responding shows that at least slurmctld is
running. Slumd, on the other hand is not. Please also get output of
slurmd log or running "slurmd -Dvvv"
On Jan 15, 2018 06:42, "Elisabetta Falivene"
wrote:
> > Anyway I suggest to update the operating system to stretch and fix
> Anyway I suggest to update the operating system to stretch and fix your
> configuration under a more recent version of slurm.
I think I'll soon arrive to that :)
b
2018-01-15 14:08 GMT+01:00 Gennaro Oliva :
> Ciao Elisabetta,
>
> On Mon, Jan 15, 2018 at 01:13:27PM +0100, Elisabetta Falivene wr
com/>
From: Elisabetta Falivene
Sent: Monday, January 15, 2018 7:14 AM
To: Slurm User Community List
Subject: [slurm-users] Slurm not starting
I did an upgrade from wheezy to jessie (automatically with a normal
dist-upgrade) on a cluster with 8 nodes (up, running and reachable) an
Ciao Elisabetta,
On Mon, Jan 15, 2018 at 01:13:27PM +0100, Elisabetta Falivene wrote:
> Error messages are not much helping me in guessing what is going on. What
> should I check to get what is failing?
check slurmctld.log and slurmd.log, you can find them under
/var/log/slurm-llnl
> *PARTITION
I did an upgrade from wheezy to jessie (automatically with a normal
dist-upgrade) on a cluster with 8 nodes (up, running and reachable) and
from slurm 2.3.4 to 14.03.9. Overcame some problems booting kernel (thank
you vey much to Gennaro Oliva, btw), now the system is running correctly
with kernel
20 matches
Mail list logo