Yes, I have tried that. I used crm_resource --meta -p resource-stickiness -v 0
-r SS16201289RN00023 to disable resource stickiness and then kill -9 <pid> to
kill the application associated with the master resource. The results are the
same: the slave resource remains a slave while the failed resource is
restarted and becomes master again.
One approach that seems to work is to run crm_resource -M -r
ms-SS16201289RN00023 -H mgraid-16201289RN00023-1 to move the resource to the
other node (assuming that the master is running on node
mgraid-16201289RN00023-0.) My original understanding was that this would
"restart" the resource on the destination node, but that was apparently a
misunderstanding. I can change our scripts to use this approach, but a)
thought that maintain the approach of demoting the master resource and
promoting the slave to master was more generic and b) I am unsure of any
potential side effects of moving the resource. Given what I'm trying to
accomplish, is this in fact the preferred approach?
Regards,
Michael
-----Original Message-----
From: Users <[email protected]> On Behalf Of
[email protected]
Sent: Monday, August 12, 2019 1:10 PM
To: [email protected]
Subject: [EXTERNAL] Users Digest, Vol 55, Issue 19
Send Users mailing list submissions to
[email protected]<mailto:[email protected]>
To subscribe or unsubscribe via the World Wide Web, visit
https://lists.clusterlabs.org/mailman/listinfo/users
or, via email, send a message with subject or body 'help' to
[email protected]<mailto:[email protected]>
You can reach the person managing the list at
[email protected]<mailto:[email protected]>
When replying, please edit your Subject line so it is more specific than "Re:
Contents of Users digest..."
Today's Topics:
1. why is node fenced ? (Lentes, Bernd)
2. Postgres HA - pacemaker RA do not support auto failback (Shital A)
3. Re: why is node fenced ? (Chris Walker)
4. Re: Master/slave failover does not work as expected
(Andrei Borzenkov)
----------------------------------------------------------------------
Message: 1
Date: Mon, 12 Aug 2019 18:09:24 +0200 (CEST)
From: "Lentes, Bernd"
<[email protected]<mailto:[email protected]>>
To: Pacemaker ML <[email protected]<mailto:[email protected]>>
Subject: [ClusterLabs] why is node fenced ?
Message-ID:
<546330844.1686419.1565626164456.javamail.zim...@helmholtz-muenchen.de<mailto:546330844.1686419.1565626164456.javamail.zim...@helmholtz-muenchen.de>>
Content-Type: text/plain; charset=utf-8
Hi,
last Friday (9th of August) i had to install patches on my two-node cluster.
I put one of the nodes (ha-idg-2) into standby (crm node standby ha-idg-2),
patched it, rebooted, started the cluster (systemctl start pacemaker) again,
put the node again online, everything fine.
Then i wanted to do the same procedure with the other node (ha-idg-1).
I put it in standby, patched it, rebooted, started pacemaker again.
But then ha-idg-1 fenced ha-idg-2, it said the node is unclean.
I know that nodes which are unclean need to be shutdown, that's logical.
But i don't know from where the conclusion comes that the node is unclean
respectively why it is unclean, i searched in the logs and didn't find any hint.
I put the syslog and the pacemaker log on a seafile share, i'd be very thankful
if you'll have a look.
https://hmgubox.helmholtz-muenchen.de/d/53a10960932445fb9cfe/
Here the cli history of the commands:
17:03:04 crm node standby ha-idg-2
17:07:15 zypper up (install Updates on ha-idg-2)
17:17:30 systemctl reboot
17:25:21 systemctl start pacemaker.service
17:25:47 crm node online ha-idg-2
17:26:35 crm node standby ha-idg1-
17:30:21 zypper up (install Updates on ha-idg-1)
17:37:32 systemctl reboot
17:43:04 systemctl start pacemaker.service
17:44:00 ha-idg-1 is fenced
Thanks.
Bernd
OS is SLES 12 SP4, pacemaker 1.1.19, corosync 2.3.6-9.13.1
--
Bernd Lentes
Systemadministration
Institut f?r Entwicklungsgenetik
Geb?ude 35.34 - Raum 208
HelmholtzZentrum m?nchen
[email protected]<mailto:[email protected]>
phone: +49 89 3187 1241
phone: +49 89 3187 3827
fax: +49 89 3187 2294
http://www.helmholtz-muenchen.de/idg
Perfekt ist wer keine Fehler macht
Also sind Tote perfekt
Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de<http://www.helmholtz-muenchen.de>
Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler,
Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671
------------------------------
Message: 2
Date: Mon, 12 Aug 2019 12:24:02 +0530
From: Shital A <[email protected]<mailto:[email protected]>>
To: [email protected]<mailto:[email protected]>,
[email protected]<mailto:[email protected]>
Subject: [ClusterLabs] Postgres HA - pacemaker RA do not support auto
failback
Message-ID:
<camp7vw_kf2em_buh_fpbznc9z6pvvx+7rxjymhfmcozxuwg...@mail.gmail.com<mailto:camp7vw_kf2em_buh_fpbznc9z6pvvx+7rxjymhfmcozxuwg...@mail.gmail.com>>
Content-Type: text/plain; charset="utf-8"
Hello,
Postgres version : 9.6
OS:Rhel 7.6
We are working on HA setup for postgres cluster of two nodes in
active-passive mode.
Installed:
Pacemaker 1.1.19
Corosync 2.4.3
The pacemaker agent with this installation doesn't support automatic
failback. What I mean by that is explained below:
1. Cluster is setup like A - B with A as master.
2. Kill services on A, node B will come up as master.
3. node A is ready to join the cluster, we have to delete the lock file it
creates on any one of the node and execute the cleanup command to get the
node back as standby
Step 3 is manual so HA is not achieved in real sense.
Please help to check:
1. Is there any version of the resouce agent which supports automatic
failback? To avoid generation of lock file and deleting it.
2. If there is no such support, if we need such functionality, do we have
to modify existing code?
How this can be achieved. Please suggest.
Thanks.
Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<https://lists.clusterlabs.org/pipermail/users/attachments/20190812/737a010e/attachment-0001.html>
------------------------------
Message: 3
Date: Mon, 12 Aug 2019 17:47:02 +0000
From: Chris Walker <[email protected]<mailto:[email protected]>>
To: Cluster Labs - All topics related to open-source clustering
welcomed <[email protected]<mailto:[email protected]>>
Subject: Re: [ClusterLabs] why is node fenced ?
Message-ID:
<[email protected]<mailto:[email protected]>>
Content-Type: text/plain; charset="utf-8"
When ha-idg-1 started Pacemaker around 17:43, it did not see ha-idg-2, for
example,
Aug 09 17:43:05 [6318] ha-idg-1 pacemakerd: info: pcmk_quorum_notification:
Quorum retained | membership=1320 members=1
after ~20s (dc-deadtime parameter), ha-idg-2 is marked 'unclean' and STONITHed
as part of startup fencing.
There is nothing in ha-idg-2's HA logs around 17:43 indicating that it saw
ha-idg-1 either, so it appears that there was no communication at all between
the two nodes.
I'm not sure exactly why the nodes did not see one another, but there are
indications of network issues around this time
2019-08-09T17:42:16.427947+02:00 ha-idg-2 kernel: [ 1229.245533] bond1: now
running without any active interface!
so perhaps that's related.
HTH,
Chris
?On 8/12/19, 12:09 PM, "Users on behalf of Lentes, Bernd"
<[email protected] on behalf of
[email protected]<mailto:[email protected]%20on%20behalf%20of%[email protected]>>
wrote:
Hi,
last Friday (9th of August) i had to install patches on my two-node cluster.
I put one of the nodes (ha-idg-2) into standby (crm node standby ha-idg-2),
patched it, rebooted,
started the cluster (systemctl start pacemaker) again, put the node again
online, everything fine.
Then i wanted to do the same procedure with the other node (ha-idg-1).
I put it in standby, patched it, rebooted, started pacemaker again.
But then ha-idg-1 fenced ha-idg-2, it said the node is unclean.
I know that nodes which are unclean need to be shutdown, that's logical.
But i don't know from where the conclusion comes that the node is unclean
respectively why it is unclean,
i searched in the logs and didn't find any hint.
I put the syslog and the pacemaker log on a seafile share, i'd be very
thankful if you'll have a look.
https://hmgubox.helmholtz-muenchen.de/d/53a10960932445fb9cfe/
Here the cli history of the commands:
17:03:04 crm node standby ha-idg-2
17:07:15 zypper up (install Updates on ha-idg-2)
17:17:30 systemctl reboot
17:25:21 systemctl start pacemaker.service
17:25:47 crm node online ha-idg-2
17:26:35 crm node standby ha-idg1-
17:30:21 zypper up (install Updates on ha-idg-1)
17:37:32 systemctl reboot
17:43:04 systemctl start pacemaker.service
17:44:00 ha-idg-1 is fenced
Thanks.
Bernd
OS is SLES 12 SP4, pacemaker 1.1.19, corosync 2.3.6-9.13.1
--
Bernd Lentes
Systemadministration
Institut f?r Entwicklungsgenetik
Geb?ude 35.34 - Raum 208
HelmholtzZentrum m?nchen
[email protected]<mailto:[email protected]>
phone: +49 89 3187 1241
phone: +49 89 3187 3827
fax: +49 89 3187 2294
http://www.helmholtz-muenchen.de/idg
Perfekt ist wer keine Fehler macht
Also sind Tote perfekt
Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de<http://www.helmholtz-muenchen.de>
Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich
Bassler, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/
------------------------------
Message: 4
Date: Mon, 12 Aug 2019 23:09:31 +0300
From: Andrei Borzenkov <[email protected]<mailto:[email protected]>>
To: Cluster Labs - All topics related to open-source clustering
welcomed <[email protected]<mailto:[email protected]>>
Cc: Venkata Reddy Chappavarapu
<[email protected]<mailto:[email protected]>>
Subject: Re: [ClusterLabs] Master/slave failover does not work as
expected
Message-ID:
<CAA91j0WxSxt_eVmUvXgJ_0goBkBw69r3o-VesRvGc6atg6o=j...@mail.gmail.com<mailto:CAA91j0WxSxt_eVmUvXgJ_0goBkBw69r3o-VesRvGc6atg6o=j...@mail.gmail.com>>
Content-Type: text/plain; charset="utf-8"
On Mon, Aug 12, 2019 at 4:12 PM Michael Powell <
[email protected]<mailto:[email protected]>> wrote:
> At 07:44:49, the ss agent discovers that the master instance has failed on
> node *mgraid?-0* as a result of a failed *ssadm* request in response to
> an *ss_monitor()* operation. It issues a *crm_master -Q -D* command with
> the intent of demoting the master and promoting the slave, on the other
> node, to master. The *ss_demote()* function finds that the application
> is no longer running and returns *OCF_NOT_RUNNING* (7). In the older
> product, this was sufficient to promote the other instance to master, but
> in the current product, that does not happen. Currently, the failed
> application is restarted, as expected, and is promoted to master, but this
> takes 10?s of seconds.
>
>
>
Did you try to disable resource stickiness for this ms?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<https://lists.clusterlabs.org/pipermail/users/attachments/20190812/12978d55/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.gif
Type: image/gif
Size: 1854 bytes
Desc: not available
URL:
<https://lists.clusterlabs.org/pipermail/users/attachments/20190812/12978d55/attachment.gif>
------------------------------
Subject: Digest Footer
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/
------------------------------
End of Users Digest, Vol 55, Issue 19
*************************************
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/