[ClusterLabs] pacemaker won't start because duplicate node but can't remove dupe node because pacemaker won't start

JC Tue, 17 Dec 2019 23:44:37 -0800

I’ve asked this question on server fault and I’ll re-ask the whole thing here 
for posterity sake:


https://serverfault.com/questions/995981/pacemaker-wont-start-because-duplicate-node-but-cant-remove-dupe-node-because
 
<https://serverfault.com/questions/995981/pacemaker-wont-start-because-duplicate-node-but-cant-remove-dupe-node-because>

OK! Really new to pacemaker/corosync, like 1 day new.

Software: Ubuntu 18.04 LTS and the versions associated with that distro.

pacemakerd: 1.1.18

corosync: 2.4.3

I accidentally removed the nodes from my entire test cluster (3 nodes)

When I tried to bring everything back up using `pcsd` GUI, that failed because 
the nodes were "wiped out". Cool.

So. I had a copy of the last `corosync.conf` from my "primary" node. I copied 
to the other two nodes. I fixed the `bindnetaddr` on the respective confs. I 
ran `pcs cluster start` on my "primary" node.

One of the nodes failed to come up. I took a look at the status of `pacemaker` 
on that node and I get the following exception:

    Dec 18 06:33:56 region-ctrl-2 crmd[1049]:     crit: Nodes 1084777441 and 2 
share the same name 'region-ctrl-2': shutting down

I tried running `crm_node -R --force 1084777441` on the machine where 
`pacemaker` won't start, but of course, `pacemaker` isn't running so I get an 
`crmd: connection refused (111)` error. So, I ran the same command on one of 
the healthy nodes, which shows no errors, but the node never goes away and 
`pacemaker` on the affected machine continued to show the same error.

So, I decided to tear down the entire cluster and again. I purge removed all 
the packages from the machine. I reinstalled everything fresh. I copied and 
fixed the `corosync.conf` to the machine. I recreated the cluster. I get the 
exact same bloody error.

So this node named `1084777441` is not a machine I created. This is one the 
cluster created for me. Earlier in the day I realized that I was using IP 
addresses in `corosync.conf` instead of names. I fixed the `/etc/hosts` of the 
machines, removed the IP addresses from the corosync config, and that's why I 
inadvertently deleted my whole cluster in the first place (I removed the nodes 
that were IP addresses).

The following is my corosync.conf:

    totem {
            version: 2
            cluster_name: maas-cluster
            token: 3000
            token_retransmits_before_loss_const: 10
            clear_node_high_bit: yes
            crypto_cipher: none
            crypto_hash: none

            interface {
                ringnumber: 0
                bindnetaddr: 192.168.99.225
                mcastport: 5405
                ttl: 1
            }
        }

        logging {
            fileline: off
            to_stderr: no
            to_logfile: no
            to_syslog: yes
            syslog_facility: daemon
            debug: off
            timestamp: on

            logger_subsys {
                subsys: QUORUM
                debug: off
            }
        }

        quorum {
            provider: corosync_votequorum
            expected_votes: 3
            two_node: 1
        }

        nodelist {
            node {
                ring0_addr: postgres-sb
                nodeid: 3
            }

            node {
                ring0_addr: region-ctrl-2
                nodeid: 2
            }

            node {
                ring0_addr: region-ctrl-1
                nodeid: 1
            }
        }

The only thing different about this conf between the nodes is the `bindnetaddr`.

There seems to be a chicken/egg issue here unless there's some way of which I'm 
not aware to remove a node from a flat-file db or sqlite dbb somewhere or 
there's some other more authoritative way to remove a node from a cluster.

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] pacemaker won't start because duplicate node but can't remove dupe node because pacemaker won't start

Reply via email to