>>> damiano giuliani <[email protected]> schrieb am 30.06.2021 um >>> 13:44 in Nachricht <CAG=zYNNe=azzalehe3jzkahnsev88nr+yeo0m06hljl4l11...@mail.gmail.com>: > Hi Guys, > > sorry for bothering, unfortunally i was called for an issue related to a > cluster i did months ago which was fully functional till last saturday. > > looks some applications lost connection to the master losing some > update/insert. > > i found the cause into the logs, the psqld-monitor went timeout after > 10000ms and the master resource been demote, the instance stopped and then > promoted to master again, generating few seconds of disservices (no master > during the described process)
Well, I think YOU have to find out why the monitor timed out. Maybe the disks being used were too busy, maybe the memory was tight, ... WE don't know. > > i noticed a redundant info: > Update score of "ltaoperdbsXX" from 990 to 1000 because of a change in the > replication lag > seems some kind of network lag? > > the network should be 10gbs where both corosync and prod network insist. > netkwork bonding on all of the nodes. > PAF version resource-agents-paf-2.3.0-1.rhel7.noarch > Postgres psql (13.1) > pacemaker-1.1.23-1.el7.x86_64 > pcs-0.9.169-3.el7.centos.x86_64 > > i attached the log could be useful to dig further. > Can some guys point me on the right direction, should be really appreciate. > > thanks for the support > Pepe _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
