Hello, Thanks for looking at this issue! Snippets from /var/log/messages and /var/log/pacemaker.log are below. _Vitaly
Here is /var/log/pacemaker.log snippet around the failure: Jul 05 11:54:33 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (svc_read_output@services_linux.c:277) trace: Reading M_IP_monitor_10000 stdout into offset 177 Jul 05 11:54:34 tomcat-rhino(tomcat-instance)[2295103]: INFO: [tomcat] Leave tomcat start 0 Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (svc_read_output@services_linux.c:280) trace: Reading tomcat-instance_start_0 stderr into offset 0 Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (svc_read_output@services_linux.c:277) trace: Reading tomcat-instance_start_0 stdout into offset 10505 Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (log_finished@execd_commands.c:214) info: tomcat-instance start (call 59, PID 2295103) exited with status 0 (execution time 110997ms, queue time 0ms) Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (log_execute@execd_commands.c:232) info: executing - rsc:N1F1 action:start call_id:66 Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (log_execute@execd_commands.c:232) info: executing - rsc:fs_monitor action:start call_id:67 Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (svc_read_output@services_linux.c:280) trace: Reading fs_monitor_start_0 stdout into offset 0 Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (svc_read_output@services_linux.c:287) trace: Got 54 chars: 2298369 (process ID) old priority 0, new priority -10 Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (svc_read_output@services_linux.c:280) trace: Reading N1F1_start_0 stdout into offset 0 Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (svc_read_output@services_linux.c:287) trace: Got 175 chars: 8: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (svc_read_output@services_linux.c:280) trace: Reading tomcat-instance_monitor_10000 stderr into offset 0 Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (svc_read_output@services_linux.c:280) trace: Reading tomcat-instance_monitor_10000 stdout into offset 0 Jul 05 11:54:34 fs_monitor-rhino(fs_monitor)[2298359]: INFO: Started fs_monitor.sh, pid=2298369 Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (svc_read_output@services_linux.c:280) trace: Reading fs_monitor_start_0 stderr into offset 0 Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (svc_read_output@services_linux.c:277) trace: Reading fs_monitor_start_0 stdout into offset 54 Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (log_finished@execd_commands.c:214) info: fs_monitor start (call 67, PID 2298359) exited with status 0 (execution time 31ms, queue time 0ms) Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (log_execute@execd_commands.c:232) info: executing - rsc:ClusterMonitor action:start call_id:69 Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (svc_read_output@services_linux.c:280) trace: Reading fs_monitor_monitor_10000 stderr into offset 0 Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (svc_read_output@services_linux.c:280) trace: Reading fs_monitor_monitor_10000 stdout into offset 0 Jul 05 11:54:34 IPaddr2-rhino(N1F1)[2298357]: INFO: Adding inet address 172.18.51.93/23 with broadcast address 172.18.51.255 to device bond0 (with label bond0:N1F1) Jul 05 11:54:34 IPaddr2-rhino(N1F1)[2298357]: INFO: Bringing device bond0 up Set r/w permissions for uid=189, gid=189 on /var/log/pacemaker.log Jul 05 11:54:34 IPaddr2-rhino(N1F1)[2298357]: INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /run/resource-agents/send_arp-172.18.51.93 bond0 172.18.51.93 auto not_used not_used Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (svc_read_output@services_linux.c:280) trace: Reading N1F1_start_0 stderr into offset 0 Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (svc_read_output@services_linux.c:277) trace: Reading N1F1_start_0 stdout into offset 175 Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (log_finished@execd_commands.c:214) info: N1F1 start (call 66, PID 2298357) exited with status 0 (execution time 68ms, queue time 0ms) Jul 05 11:54:34 cluster_monitor-rhino(ClusterMonitor)[2298481]: INFO: Started cluster_monitor.sh, pid=2298549 Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (log_finished@execd_commands.c:214) info: ClusterMonitor start (call 69, PID 2298481) exited with status 0 (execution time 40ms, queue time 0ms) Set r/w permissions for uid=189, gid=189 on /var/log/pacemaker.log Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (svc_read_output@services_linux.c:280) trace: Reading ClusterMonitor_monitor_10000 stdout into offset 0 Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (svc_read_output@services_linux.c:287) trace: Got 8 chars: 2298549 Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (svc_read_output@services_linux.c:280) trace: Reading N1F1_monitor_10000 stdout into offset 0 Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (svc_read_output@services_linux.c:287) trace: Got 175 chars: 8: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state Set r/w permissions for uid=189, gid=189 on /var/log/pacemaker.log Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (svc_read_output@services_linux.c:280) trace: Reading N1F1_monitor_10000 stderr into offset 0 Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (svc_read_output@services_linux.c:277) trace: Reading N1F1_monitor_10000 stdout into offset 175 Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (svc_read_output@services_linux.c:277) trace: Reading ClusterMonitor_monitor_10000 stdout into offset 8 Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (svc_read_output@services_linux.c:287) trace: Got 499 chars: root 2298549 0.0 0.0 127584 13156 ? S 11:54 0:00 /sbin/crm_mon Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (svc_read_output@services_linux.c:287) trace: Got 19 chars: ep cluster_monitor Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (svc_read_output@services_linux.c:280) trace: Reading ClusterMonitor_monitor_10000 stderr into offset 0 Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (svc_read_output@services_linux.c:277) trace: Reading ClusterMonitor_monitor_10000 stdout into offset 526 Set r/w permissions for uid=189, gid=189 on /var/log/pacemaker.log Set r/w permissions for uid=189, gid=189 on /var/log/pacemaker.log Set r/w permissions for uid=189, gid=189 on /var/log/pacemaker.log Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (svc_read_output@services_linux.c:280) trace: Reading ethmonitor_monitor_10000 stderr into offset 0 Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] (svc_read_output@services_linux.c:280) trace: Reading ethmonitor_monitor_10000 stdout into offset 0 Set r/w permissions for uid=189, gid=189 on /var/log/pacemaker.log Set r/w permissions for uid=189, gid=189 on /var/log/pacemaker.log Set r/w permissions for uid=189, gid=189 on /var/log/pacemaker.log Jul 05 11:54:34 pgsql-rhino(postgres)[2298353]: INFO: Changing pgsql-data-status on d19-25-left.lab.archivas.com : DISCONNECT->STREAMING|ASYNC. Set r/w permissions for uid=189, gid=189 on /var/log/pacemaker.log Set r/w permissions for uid=189, gid=189 on /var/log/pacemaker.log Set r/w permissions for uid=189, gid=189 on /var/log/pacemaker.log Set r/w permissions for uid=189, gid=189 on /var/log/pacemaker.log Set r/w permissions for uid=189, gid=189 on /var/log/pacemaker.log Set r/w permissions for uid=189, gid=189 on /var/log/pacemaker.log Set r/w permissions for uid=189, gid=189 on /var/log/pacemaker.log Jul 05 11:54:35 pgsql-rhino(postgres)[2298353]: INFO: Setup d19-25-left.lab.archivas.com into sync mode. ********* Here is /var/log/messages snippet around the failure: Jul 5 11:54:02 d19-25-right systemd[1]: session-1133.scope: Succeeded. Jul 5 11:54:34 d19-25-right tomcat-rhino(tomcat-instance)[2295103]: INFO: [tomcat] Leave tomcat start 0 Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Result of start operation for tomcat-instance on d19-25-right.lab.archivas.com: ok Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Requesting local execution of monitor operation for postgres on d19-25-right.lab.archivas.com Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: error: Failed to receive meta-data for ocf:heartbeat:pgsql-rhino Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: warning: Failed to get metadata for postgres (ocf:heartbeat:pgsql-rhino) Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Requesting local execution of monitor operation for tomcat-instance on d19-25-right.lab.archivas.com Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Requesting local execution of monitor operation for ethmonitor on d19-25-right.lab.archivas.com Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Requesting local execution of start operation for N1F1 on d19-25-right.lab.archivas.com Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Requesting local execution of start operation for fs_monitor on d19-25-right.lab.archivas.com Jul 5 11:54:34 d19-25-right fs_monitor-rhino(fs_monitor)[2298359]: INFO: Started fs_monitor.sh, pid=2298369 Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Result of monitor operation for tomcat-instance on d19-25-right.lab.archivas.com: ok Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Result of start operation for fs_monitor on d19-25-right.lab.archivas.com: ok Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Requesting local execution of monitor operation for fs_monitor on d19-25-right.lab.archivas.com Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Requesting local execution of start operation for ClusterMonitor on d19-25-right.lab.archivas.com Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Result of monitor operation for fs_monitor on d19-25-right.lab.archivas.com: ok Jul 5 11:54:34 d19-25-right IPaddr2-rhino(N1F1)[2298357]: INFO: Adding inet address 172.18.51.93/23 with broadcast address 172.18.51.255 to device bond0 (with label bond0:N1F1) Jul 5 11:54:34 d19-25-right IPaddr2-rhino(N1F1)[2298357]: INFO: Bringing device bond0 up Jul 5 11:54:34 d19-25-right IPaddr2-rhino(N1F1)[2298357]: INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /run/resource-agents/send_arp-172.18.51.93 bond0 172.18.51.93 auto not_used not_used Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Result of start operation for N1F1 on d19-25-right.lab.archivas.com: ok Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Requesting local execution of monitor operation for N1F1 on d19-25-right.lab.archivas.com Jul 5 11:54:34 d19-25-right cluster_monitor-rhino(ClusterMonitor)[2298481]: INFO: Started cluster_monitor.sh, pid=2298549 Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Result of start operation for ClusterMonitor on d19-25-right.lab.archivas.com: ok Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Requesting local execution of monitor operation for ClusterMonitor on d19-25-right.lab.archivas.com Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Result of monitor operation for N1F1 on d19-25-right.lab.archivas.com: ok Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Result of monitor operation for ClusterMonitor on d19-25-right.lab.archivas.com: ok Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Result of monitor operation for ethmonitor on d19-25-right.lab.archivas.com: ok Jul 5 11:54:34 d19-25-right pgsql-rhino(postgres)[2298353]: INFO: Changing pgsql-data-status on d19-25-left.lab.archivas.com : DISCONNECT->STREAMING|ASYNC. Jul 5 11:54:35 d19-25-right pgsql-rhino(postgres)[2298353]: INFO: Setup d19-25-left.lab.archivas.com into sync mode. > On 07/04/2022 3:57 PM Reid Wahl <[email protected]> wrote: > > > On Mon, Jul 4, 2022 at 7:19 AM vitaly <[email protected]> wrote: > > > > I get printout of metadata as follows: > > d19-25-left.lab.archivas.com ~ # OCF_ROOT=/usr/lib/ocf > > /usr/lib/ocf/resource.d/heartbeat/pgsql-rhino meta-data > > <?xml version="1.0"?> > > <!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd"> > > <resource-agent name="pgsql-rhino"> > > <version>1.0</version> > > > > <longdesc lang="en"> > > Resource script for PostgreSQL. It manages a PostgreSQL as an HA resource. > > </longdesc> > > <shortdesc lang="en">Manages a PostgreSQL database instance</shortdesc> > > > > <parameters> > > <parameter name="pgctl" unique="0" required="0"> > > <longdesc lang="en"> > > Path to pg_ctl command. > > </longdesc> > > <shortdesc lang="en">pgctl</shortdesc> > > <content type="string" default="/usr/bin/pg_ctl" /> > > </parameter> > > > > <parameter name="start_opt" unique="0" required="0"> > > <longdesc lang="en"> > > Start options (-o start_opt in pg_ctl). "-i -p 5432" for example. > > </longdesc> > > <shortdesc lang="en">start_opt</shortdesc> > > <content type="string" default="" /> > > > > </parameter> > > <parameter name="ctl_opt" unique="0" required="0"> > > <longdesc lang="en"> > > Additional pg_ctl options (-w, -W etc..). > > </longdesc> > > <shortdesc lang="en">ctl_opt</shortdesc> > > <content type="string" default="" /> > > </parameter> > > > > <parameter name="psql" unique="0" required="0"> > > <longdesc lang="en"> > > Path to psql command. > > </longdesc> > > <shortdesc lang="en">psql</shortdesc> > > <content type="string" default="/usr/bin/psql" /> > > </parameter> > > > > <parameter name="pgdata" unique="0" required="0"> > > <longdesc lang="en"> > > Path to PostgreSQL data directory. > > </longdesc> > > <shortdesc lang="en">pgdata</shortdesc> > > <content type="string" default="/var/lib/pgsql/data" /> > > </parameter> > > > > <parameter name="pgdba" unique="0" required="0"> > > <longdesc lang="en"> > > User that owns PostgreSQL. > > </longdesc> > > <shortdesc lang="en">pgdba</shortdesc> > > <content type="string" default="postgres" /> > > </parameter> > > > > <parameter name="pghost" unique="0" required="0"> > > <longdesc lang="en"> > > Hostname/IP address where PostgreSQL is listening > > </longdesc> > > <shortdesc lang="en">pghost</shortdesc> > > <content type="string" default="" /> > > </parameter> > > > > <parameter name="pgport" unique="0" required="0"> > > <longdesc lang="en"> > > Port where PostgreSQL is listening > > </longdesc> > > <shortdesc lang="en">pgport</shortdesc> > > <content type="integer" default="5432" /> > > </parameter> > > > > <parameter name="monitor_user" unique="0" required="0"> > > <longdesc lang="en"> > > PostgreSQL user that pgsql RA will user for monitor operations. If it's not > > set > > pgdba user will be used. > > </longdesc> > > <shortdesc lang="en">monitor_user</shortdesc> > > <content type="string" default="" /> > > </parameter> > > > > <parameter name="monitor_password" unique="0" required="0"> > > <longdesc lang="en"> > > Password for monitor user. > > </longdesc> > > <shortdesc lang="en">monitor_password</shortdesc> > > <content type="string" default="" /> > > </parameter> > > > > <parameter name="monitor_sql" unique="0" required="0"> > > <longdesc lang="en"> > > SQL script that will be used for monitor operations. > > </longdesc> > > <shortdesc lang="en">monitor_sql</shortdesc> > > <content type="string" default="select now();" /> > > </parameter> > > > > <parameter name="config" unique="0" required="0"> > > <longdesc lang="en"> > > Path to the PostgreSQL configuration file for the instance. > > </longdesc> > > <shortdesc lang="en">Configuration file</shortdesc> > > <content type="string" default="/var/lib/pgsql/data/postgresql.conf" /> > > </parameter> > > > > <parameter name="pgdb" unique="0" required="0"> > > <longdesc lang="en"> > > Database that will be used for monitoring. > > </longdesc> > > <shortdesc lang="en">pgdb</shortdesc> > > <content type="string" default="rhinodb" /> > > </parameter> > > > > <parameter name="logfile" unique="0" required="0"> > > <longdesc lang="en"> > > Path to PostgreSQL server log output file. > > </longdesc> > > <shortdesc lang="en">logfile</shortdesc> > > <content type="string" default="/dev/null" /> > > </parameter> > > > > <parameter name="socketdir" unique="0" required="0"> > > <longdesc lang="en"> > > Unix socket directory for PostgeSQL > > </longdesc> > > <shortdesc lang="en">socketdir</shortdesc> > > <content type="string" default="" /> > > </parameter> > > > > <parameter name="stop_escalate" unique="0" required="0"> > > <longdesc lang="en"> > > Number of shutdown retries (using -m fast) before resorting to -m immediate > > </longdesc> > > <shortdesc lang="en">stop escalation</shortdesc> > > <content type="integer" default="30" /> > > </parameter> > > > > <parameter name="rep_mode" unique="0" required="0"> > > <longdesc lang="en"> > > Replication mode(none(default)/async/sync). > > "async" and "sync" require PostgreSQL 9.1 or later. > > If you use async or sync, it requires node_list, master_ip, restore_command > > parameters, and needs setting postgresql.conf, pg_hba.conf up for > > replication. > > Please delete "include /../../rep_mode.conf" line in postgresql.conf > > when you switch from sync to async. > > </longdesc> > > <shortdesc lang="en">rep_mode</shortdesc> > > <content type="string" default="none" /> > > </parameter> > > > > <parameter name="node_list" unique="0" required="0"> > > <longdesc lang="en"> > > All node names. Please separate each node name with a space. > > This is required for replication. > > </longdesc> > > <shortdesc lang="en">node list</shortdesc> > > <content type="string" default="" /> > > </parameter> > > > > <parameter name="restore_command" unique="0" required="0"> > > <longdesc lang="en"> > > restore_command for recovery.conf. > > This is required for replication. > > </longdesc> > > <shortdesc lang="en">restore_command</shortdesc> > > <content type="string" default="" /> > > </parameter> > > > > <parameter name="master_ip" unique="0" required="0"> > > <longdesc lang="en"> > > Master's floating IP address to be connected from hot standby. > > This parameter is used for "primary_conninfo" in recovery.conf. > > This is required for replication. > > </longdesc> > > <shortdesc lang="en">master ip</shortdesc> > > <content type="string" default="" /> > > </parameter> > > > > <parameter name="repuser" unique="0" required="0"> > > <longdesc lang="en"> > > User used to connect to the master server. > > This parameter is used for "primary_conninfo" in recovery.conf. > > This is required for replication. > > </longdesc> > > <shortdesc lang="en">repuser</shortdesc> > > <content type="string" default="postgres" /> > > </parameter> > > > > <parameter name="remote_wals_dir" unique="0" required="1"> > > <longdesc lang="en"> > > Location of WALS archived by the other node > > </longdesc> > > <shortdesc lang="en">remote_wals_dir</shortdesc> > > <content type="string" default="" /> > > </parameter> > > > > <parameter name="xlogs_dir" unique="0" required="1"> > > <longdesc lang="en"> > > Location of WALS on current node in Rhino before 2.2.0 > > </longdesc> > > <shortdesc lang="en">xlogs_dir</shortdesc> > > <content type="string" default="" /> > > </parameter> > > > > <parameter name="wals_dir" unique="0" required="1"> > > <longdesc lang="en"> > > Location of WALS on current node in Rhino 2.2.0 and later > > </longdesc> > > <shortdesc lang="en">wals_dir</shortdesc> > > <content type="string" default="" /> > > </parameter> > > > > <parameter name="reppassword" unique="0" required="0"> > > <longdesc lang="en"> > > User used to connect to the master server. > > This parameter is used for "primary_conninfo" in recovery.conf. > > This is required for replication. > > </longdesc> > > <shortdesc lang="en">reppassword</shortdesc> > > <content type="string" default="" /> > > </parameter> > > > > <parameter name="primary_conninfo_opt" unique="0" required="0"> > > <longdesc lang="en"> > > primary_conninfo options of recovery.conf except host, port, user and > > application_name. > > This is optional for replication. > > </longdesc> > > <shortdesc lang="en">primary_conninfo_opt</shortdesc> > > <content type="string" default="" /> > > </parameter> > > > > <parameter name="tmpdir" unique="0" required="0"> > > <longdesc lang="en"> > > Path to temporary directory. > > This is optional for replication. > > </longdesc> > > <shortdesc lang="en">tmpdir</shortdesc> > > <content type="string" default="/var/lib/pgsql/tmp" /> > > </parameter> > > > > <parameter name="xlog_check_count" unique="0" required="0"> > > <longdesc lang="en"> > > Number of checking xlog on monitor before promote. > > This is optional for replication. > > </longdesc> > > <shortdesc lang="en">xlog check count</shortdesc> > > <content type="integer" default="" /> > > </parameter> > > > > <parameter name="crm_attr_timeout" unique="0" required="0"> > > <longdesc lang="en"> > > The timeout of crm_attribute forever update command. > > Default value is 5 seconds. > > This is optional for replication. > > </longdesc> > > <shortdesc lang="en">The timeout of crm_attribute forever update > > command.</shortdesc> > > <content type="integer" default="5" /> > > </parameter> > > > > <parameter name="stop_escalate_in_slave" unique="0" required="0"> > > <longdesc lang="en"> > > Number of shutdown retries (using -m fast) before resorting to -m immediate > > in Slave state. > > This is optional for replication. > > </longdesc> > > <shortdesc lang="en">stop escalation_in_slave</shortdesc> > > <content type="integer" default="30" /> > > </parameter> > > > > <parameter name="process_start_timeout" unique="0" required="0"> > > <longdesc lang="en"> > > Number of seconds to wait for a postgreSQL process to be running but not > > necessarilly usable > > </longdesc> > > <shortdesc lang="en">Seconds to wait for a process to be running</shortdesc> > > <content type="integer" default="30" /> > > </parameter> > > > > <parameter name="start_attempts_force_recover" unique="0" required="0"> > > <longdesc lang="en"> > > Number of failed starts before the system forces a recovery from the master > > database > > </longdesc> > > <shortdesc lang="en">Start failures before recovery</shortdesc> > > <content type="integer" default="20" /> > > </parameter> > > > > <parameter name="rhino_config_file" unique="0" required="0"> > > <longdesc lang="en"> > > Configuration file with overrides for pgsql-rhino. > > </longdesc> > > <shortdesc lang="en">Rhino configuration file</shortdesc> > > <content type="string" default="/opt/rhino/config/pgsql-rhino.conf" /> > > </parameter> > > > > </parameters> > > > > <actions> > > <action name="start" timeout="120" /> > > <action name="stop" timeout="120" /> > > <action name="status" timeout="60" /> > > <action name="monitor" depth="0" timeout="30" interval="30"/> > > <action name="monitor" depth="0" timeout="30" interval="29" role="Master" /> > > <action name="promote" timeout="120" /> > > <action name="demote" timeout="120" /> > > <action name="notify" timeout="90" /> > > <action name="meta-data" timeout="5" /> > > <action name="validate-all" timeout="5" /> > > <action name="methods" timeout="5" /> > > </actions> > > </resource-agent> > > Hmm, seems reasonable. No permissions issues, and it looks like we > should only print the "Failed to receive" message if we don't receive > any stdout at all from the meta-data action. > > Can you add the following to /etc/sysconfig/pacemaker and restart > pacemaker? Then monitor /var/log/pacemaker/pacemaker.log for relevant > trace-level messages around the same time as the "Failed to receive > meta-data" messages. > > PCMK_trace_functions=services_action_sync,svc_read_output > > This will get fairly verbose if you have more than a couple of > resources, so after you've grabbed any relevant logs, comment that > line out and restart pacemaker again. > > > > > > On 07/04/2022 5:39 AM Reid Wahl <[email protected]> wrote: > > > > > > > > > On Mon, Jul 4, 2022 at 1:06 AM Reid Wahl <[email protected]> wrote: > > > > > > > > On Sat, Jul 2, 2022 at 1:12 PM vitaly <[email protected]> wrote: > > > > > > > > > > Sorry, I noticed that I am missing meta "notice=true" and after > > > > > adding it to postgres-ms configuration "notice" events started to > > > > > come through. > > > > > Item 1 still needs explanation. As pacemaker-controld keeps > > > > > complaining. > > > > > > > > What happens when you run `OCF_ROOT=/usr/lib/ocf > > > > /usr/lib/ocf/resource.d/heartbeat/pgsql-rhino meta-data`? > > > > > > This may also be relevant: > > > https://lists.clusterlabs.org/pipermail/users/2022-June/030391.html > > > > > > > > > > > > Thanks! > > > > > _Vitaly > > > > > > > > > > > On 07/02/2022 2:04 PM vitaly <[email protected]> wrote: > > > > > > > > > > > > > > > > > > Hello Everybody. > > > > > > I have a 2 node cluster with clone resource “postgres-ms”. We are > > > > > > running following versions of pacemaker/corosync: > > > > > > d19-25-left.lab.archivas.com ~ # rpm -qa | grep > > > > > > "pacemaker\|corosync" > > > > > > pacemaker-cluster-libs-2.0.5-9.el8.x86_64 > > > > > > pacemaker-libs-2.0.5-9.el8.x86_64 > > > > > > pacemaker-cli-2.0.5-9.el8.x86_64 > > > > > > corosynclib-3.1.0-5.el8.x86_64 > > > > > > pacemaker-schemas-2.0.5-9.el8.noarch > > > > > > corosync-3.1.0-5.el8.x86_64 > > > > > > pacemaker-2.0.5-9.el8.x86_64 > > > > > > > > > > > > There are couple of issues that could be related. > > > > > > 1. There are following messages in the logs coming from > > > > > > pacemaker-controld: > > > > > > Jul 2 14:59:27 d19-25-right pacemaker-controld[1489734]: error: > > > > > > Failed to receive meta-data for ocf:heartbeat:pgsql-rhino > > > > > > Jul 2 14:59:27 d19-25-right pacemaker-controld[1489734]: warning: > > > > > > Failed to get metadata for postgres (ocf:heartbeat:pgsql-rhino) > > > > > > > > > > > > 2. ocf:heartbeat:pgsql-rhino does not get any "notice" operations > > > > > > which causes multiple issues with postgres synchronization during > > > > > > availability events. > > > > > > > > > > > > 3. Item 2 raises another question. Who is setting these values: > > > > > > ${OCF_RESKEY_CRM_meta_notify_type} > > > > > > ${OCF_RESKEY_CRM_meta_notify_operation} > > > > > > > > > > > > Here is excerpt from cluster config: > > > > > > > > > > > > d19-25-left.lab.archivas.com ~ # pcs config > > > > > > > > > > > > Cluster Name: > > > > > > Corosync Nodes: > > > > > > d19-25-right.lab.archivas.com d19-25-left.lab.archivas.com > > > > > > Pacemaker Nodes: > > > > > > d19-25-left.lab.archivas.com d19-25-right.lab.archivas.com > > > > > > > > > > > > Resources: > > > > > > Clone: postgres-ms > > > > > > Meta Attrs: promotable=true target-role=started > > > > > > Resource: postgres (class=ocf provider=heartbeat type=pgsql-rhino) > > > > > > Attributes: master_ip=172.16.1.6 > > > > > > node_list="d19-25-left.lab.archivas.com > > > > > > d19-25-right.lab.archivas.com" pgdata=/pg_data > > > > > > remote_wals_dir=/remote/walarchive rep_mode=sync reppassword=XXXXXX > > > > > > repuser=XXXXXXX > > > > > > restore_command="/opt/rhino/sil/bin/script_wrapper.sh > > > > > > wal_restore.py %f %p" tmpdir=/pg_data/tmp wals_dir=/pg_data/pg_wal > > > > > > xlogs_dir=/pg_data/pg_xlog > > > > > > Meta Attrs: is-managed=true > > > > > > Operations: demote interval=0 on-fail=restart timeout=120s > > > > > > (postgres-demote-interval-0) > > > > > > methods interval=0s timeout=5 > > > > > > (postgres-methods-interval-0s) > > > > > > monitor interval=10s on-fail=restart timeout=300s > > > > > > (postgres-monitor-interval-10s) > > > > > > monitor interval=5s on-fail=restart role=Master > > > > > > timeout=300s (postgres-monitor-interval-5s) > > > > > > notify interval=0 on-fail=restart timeout=90s > > > > > > (postgres-notify-interval-0) > > > > > > promote interval=0 on-fail=restart timeout=120s > > > > > > (postgres-promote-interval-0) > > > > > > start interval=0 on-fail=restart timeout=1800s > > > > > > (postgres-start-interval-0) > > > > > > stop interval=0 on-fail=fence timeout=120s > > > > > > (postgres-stop-interval-0) > > > > > > Thank you very much! > > > > > > _Vitaly > > > > > > _______________________________________________ > > > > > > Manage your subscription: > > > > > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > > > > > > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > > _______________________________________________ > > > > > Manage your subscription: > > > > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > > > > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > > > > > > > > > > > > > -- > > > > Regards, > > > > > > > > Reid Wahl (He/Him), RHCA > > > > Senior Software Maintenance Engineer, Red Hat > > > > CEE - Platform Support Delivery - ClusterHA > > > > > > > > > > > > -- > > > Regards, > > > > > > Reid Wahl (He/Him), RHCA > > > Senior Software Maintenance Engineer, Red Hat > > > CEE - Platform Support Delivery - ClusterHA > > > > > > _______________________________________________ > > > Manage your subscription: > > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > > -- > Regards, > > Reid Wahl (He/Him), RHCA > Senior Software Maintenance Engineer, Red Hat > CEE - Platform Support Delivery - ClusterHA _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
