Re: [ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)

Casey & Gina Thu, 31 May 2018 09:21:01 -0700

> There is no "master node" in pacemaker. There is master/slave resource
> so at the best it is "node on which specific resource has master role".
> And we have no way to know which on which node you resource had master
> role when you did it. Please be more specific, otherwise it is hard to
> impossible to follow.


Well my limited understanding is that there should be one node that's the 
master at any point in time.  I don't see how it makes sense to have resources 
with masters on different nodes in the same clusters.  I'm being as specific as 
I can given my limited knowledge.  I'm not a developer; just an admin trying to 
get a simple cluster up and running.  Years ago, I did this same thing with two 
nodes and heartbeat, and it was very easy.  Anyways, I guess I mean that I 
powered off the node that was the master for all resources at the time.

> Not specifically related to your problem but I wonder what is the
> difference. For all I know for master/slave "Started" == "Slave" so I'm
> surprised to see two different states listed here.

I also wondered about that, since from the PostgreSQL, there is one master and 
two standbys which are no different from one another.  But like you said, it 
didn't seem relevant to my problem.

> Well, apparently resource agent does not like crashed instance. It is
> quite possible, I have been working with another replicated database
> where it was necessary to manually fix configuration after failover,
> *outside* of pacemaker. Pacemaker simply failed to start resource which
> had unexpected state.

I can manually start up the database in standby mode, without any errors or 
special intervention/fixing whatsoever, as long as the replication logs have 
not gotten too far ahead on the new master.  In that case I would need to 
rebuild the standby.

> This needs someone familiar with this RA and application to answer.

The resource agent is PAF and I've seen a lot of others discussing this on this 
list, so I hope that I am asking in the right place.

> Note that it is not quite normal use case. You explicitly disabled any
> handling by RA, thus effectively not using pacemaker high availability
> at all. Does it fail over master if you do not unmanage resource and
> kill node where resource has master role?

I was following the specific instructions in the E-mail I was replying to, 
which asked me to unmanage the resource and try manual debugging steps.  As 
I've discussed in this thread (please review the previous E-mails on this 
thread for further information), pacemaker does fail over the master, but then 
when the former master node comes back online, if I do a `pcs cluster start` on 
it without manually starting up the database by hand, it fails to start the PAF 
resource and pacemaker ends up fencing the node again.

I've been told that what PAF does on resource startup is exactly the same as 
the manual commands that I can do to make it work.  In the prior E-mails on 
this thread, I was told that the reason the resource startup fails is because 
the resource agent is incorrectly determining that the resource is already 
running when it's not - so it's never even trying to start the resource at all. 
 The debug instructions I'm attempting to follow are in an attempt to figure 
out what command it is running to determine this state.  Fail over to another 
node is only half the battle - the failed node should be able to rejoin the 
cluster without the cluster immediately fencing it when I try, shouldn't it?

>> ------
>> root@d-gp2-dbpg0-2:/var# crm_resource -r postgresql-ha -VV --force-check 
>> warning: unpack_rsc_op_failure:        Processing failed op monitor for 
>> postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
>> warning: unpack_rsc_op_failure:        Processing failed op monitor for 
>> postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9)
>> Operation monitor for postgresql-10-main:0 (ocf:heartbeat:pgsqlms) returned 5
>>> stderr: Use of uninitialized value $OCF_Functions::ARG[0] in pattern match 
>>> (m//) at 
>>> /usr/lib/ocf/resource.d/heartbeat/../../lib/heartbeat/OCF_Functions.pm line 
>>> 392.
>>> stderr: ocf-exit-reason:PAF v2.2.0 is compatible with Pacemaker 1.1.13 and 
>>> greater
>> Error performing operation: Input/output error
>> ------
> 
> This looks like a bug in your version.

Version of what?  I'm using the corosync, pacemaker, and pcs versions as 
provided by Ubuntu (for version 16.04), and resource-agents-paf as provided by 
the PGDG repository.

These versions are as follows:
* corosync - 2.3.5-3ubuntu2
* pacemaker - 1.1.14-2ubuntu1.3
* pcs - 0.9.149-1ubuntu1.1
* resource-agents-paf - 2.2.0-2.pgdg16.04+1

These are the latest packaged versions available for my platform, as far as I'm 
aware, and the same as I presume other Ubuntu users on this list are running.

Regards,
-- 
Casey
_______________________________________________
Users mailing list: [email protected]
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] PAF not starting resource successfully after node reboot (was: How to set up fencing/stonith)

Reply via email to