Re: [pgpool-general: 9106] Postgres/pgpool HA failover process
BTW here is the network configuration we set as UTOCONNECT_PRIORITY=120 BOOTPROTO=dhcp DEVICE=eth0 DHCPV6C=yes HWADDR= IPV6INIT=yes ONBOOT=yes TYPE=Ethernet USERCTL=no Regards Mukesh Tanuku On Wed, May 29, 2024 at 12:53 PM Mukesh Tanuku wrote: > Thanks Bo for your response to my email. > > Below is the parameters: > > > > * if_up_cmd = '/usr/bin/sudo /sbin/ip addr add $_IP_$/24 dev enp0s8 label > enp0s8:0' if_down_cmd = '/usr/bin/sudo /sbin/ip addr del $_IP_$/24 dev > enp0s8' arping_cmd = '/usr/bin/sudo /usr/sbin/arping -U $_IP_$ -w 1 -I > enp0s8'* > > The above commands are working on Leader node. and once the VIP is up > successfully then we can run the pcp commands via VIP host work only from > the leader node locally. if we try to execute the PCP comands via VIP > throws an error > ERROR: connection to host "" failed with error "No route to host" > > What's the possible cause in this case? > > Regards > Mukesh Tanuku > > > On Tue, May 28, 2024 at 5:30 AM Bo Peng wrote: > >> Hi, >> >> > Hello Everyone, >> > >> > We are working on to setup a High available solution for postgres and >> > running some test cases >> > based on that we need some clarification on the HA setup as per the same >> > configuration given in this example >> > <https://www.pgpool.net/docs/42/en/html/example-cluster.html> >> >> If you want to configure a new cluster, >> we recommend using the latest version. >> >> https://www.pgpool.net/docs/latest/en/html/example-cluster.html >> >> > Question - 01 >> > Regarding the active client connections that are connecting via delegate >> > VIP, what happens to those connections during the failover process? If >> > either the primary DB goes down (gracefully/crash) then the >> > active connection will get hung till the new primary DB is available or >> > those active connections get terminated? >> >> The existing connections will be disconnected if primary DB is down. >> However, the subsequent connections will be routed to the new primary. >> >> > What if the same case for Leader pgpool service as well? >> >> Yes. >> >> > Questions - 02 >> > Configuring the delegate IP: >> > a. Are there any standards/limitations for choosing the unused IP as >> > delegate IP? I mean like the IP must be within the same subnet range of >> all >> > the nodes in the cluster? >> >> No limitations in Pgpool-II, it depends on the OS. >> Pgpool-II using "ip addr ..." command to assign or release a delegate IP. >> >> Below is the parameters: >> >> if_up_cmd = '/usr/bin/sudo /sbin/ip addr add $_IP_$/24 dev enp0s8 label >> enp0s8:0' >> if_down_cmd = '/usr/bin/sudo /sbin/ip addr del $_IP_$/24 dev enp0s8' >> arping_cmd = '/usr/bin/sudo /usr/sbin/arping -U $_IP_$ -w 1 -I enp0s8' >> >> You need to make sure the commands work. >> >> > b. Any challenges we see for non-root user to work with VIP? because we >> run >> > the pgpool with non-root user. >> >> You need to make sure the non-root user can execute ip and arping command >> with sudo >> without a password. >> >> -- >> Bo Peng >> SRA OSS LLC >> TEL: 03-5979-2701 FAX: 03-5979-2702 >> URL: https://www.sraoss.co.jp/ >> >
Pgpool delegate IP is not reachable from the remote host
Hello everyone. We have a pgpool & postgres setup made as per the given example <https://www.pgpool.net/docs/latest/en/html/example-cluster.html> We haven't configured the *if_up_cmd, if_down_cmd & **arping_cmd* in pgpool.conf (we commented it out on purpose). We enabled the *delegate_ip* and assigned a unused IP. While we bringing up the pgpool the log shows VIP is successfully UP via if_up_cmd, how is this happening? any default way is considering the if_up_cmd to bring the VIP? *2024-06-18 08:27:02.735: watchdog_utility pid 102493: LOG: successfully acquired the delegate IP:"10.127.1.20"2024-06-18 08:27:02.735: watchdog_utility pid 102493: DETAIL: 'if_up_cmd' returned with success2024-06-18 08:27:02.735: watchdog pid 102378: LOG: watchdog escalation process with pid: 102493 exit with SUCCESS.* NOTE: We run into some situation so that we disabled the if_* commands in the configuration and started pgpool, but VIP is getting acquired by the leader pgpool node even the if_* commands are not used in pgpool.conf file. Can anyone explain how the VIP works in pgpool watchdog Thank you Regards Mukesh Tanuku
Query 2 Node HA test case result
Hello everyone, We are doing a POC on postgres HA setup with streaming replication (async) using pgpool-II as a load balancing & connection pooling and repmgr for setting up HA & automatic failover. We are applying a test case, like isolating the VM1 node from the Network completely for more than 2 mins and again plug-in back the network, since we want to verify how the system works during network glitches, any chances of split-brain or so. Our current setup looks like below, 2 VM's on Azure cloud, each VM has Postgres running along with Pgpool service. [image: image.png] We enabled watchdog and assigned a delegate IP *NOTE: as per some limitations we are using a floating IP and used for delegate IP.* During the test, here are our observations: 1. Client connections got hung from the time the VM1 got lost from the network and till VM1 gets back to normal. 2. Once the VM1 is lost then Pgpool promotes the VM2 as LEADER node and Postgres Standby got promoted to Primary on VM2 as well, but still client connections are not connecting to the new primary. Why is this not happening? 3. Once the VM1 is back to network, there is a split brain situation, where pgpool on VM1 takes the lead to become LEADER node (pgpool.log shows). and from then the client connects to the VM1 node via VIP. *pgpool.conf * sr_check_period 10sec health_check_period 30sec health_check_timeout 20 sec health_check_max_retries 3 health_check_retry_delay 1 wd_lifecheck_method = 'heartbeat' wd_interval = 10 wd_heartbeat_keepalive = 2 wd_heartbeat_deadtime = 30 *Logs information: * >From VM2: Pgpool.log 14:30:17 N/w disconnected After 10 sec the streaming replication check failed and got timed out. 2024-07-03 14:30:26.176: sr_check_worker pid 58187: LOG: failed to connect to PostgreSQL server on "staging-ha0001:5432", timed out Then pgpool failed to do health check since it got timed out as per health_check_timeout set to 20 sec 2024-07-03 14:30:35.869: health_check0 pid 58188: LOG: failed to connect to PostgreSQL server on "staging-ha0001:5432", timed out Re-trying health_check & sr_check but again timed out. 2024-07-03 14:30:46.187: sr_check_worker pid 58187: LOG: failed to connect to PostgreSQL server on "staging-ha0001:5432", timed out 2024-07-03 14:30:46.880: health_check0 pid 58188: LOG: failed to connect to PostgreSQL server on "staging-ha0001:5432", timed out Watchdog received a message saying the Leader node is lost. 2024-07-03 14:30:47.192: watchdog pid 58151: WARNING: we have not received a beacon message from leader node "staging-ha0001: Linux staging-ha0001" 2024-07-03 14:30:47.192: watchdog pid 58151: DETAIL: requesting info message from leader node 2024-07-03 14:30:54.312: watchdog pid 58151: LOG: read from socket failed, remote end closed the connection 2024-07-03 14:30:54.312: watchdog pid 58151: LOG: client socket of staging-ha0001: Linux staging-ha0001 is closed 2024-07-03 14:30:54.313: watchdog pid 58151: LOG: remote node "staging-ha0001: Linux staging-ha0001" is reporting that it has lost us 2024-07-03 14:30:54.313: watchdog pid 58151: LOG: we are lost on the leader node "staging-ha0001: Linux staging-ha0001" Re-trying health_check & sr_check but again timed out. 2024-07-03 14:30:57.888: health_check0 pid 58188: LOG: failed to connect to PostgreSQL server on "staging-ha0001:5432", timed out 2024-07-03 14:30:57.888: health_check0 pid 58188: LOG: health check retrying on DB node: 0 (round:3) 2024-07-03 14:31:06.201: sr_check_worker pid 58187: LOG: failed to connect to PostgreSQL server on "staging-ha0001:5432", timed out After 10 sec from the time we lost the leader node, watchdog changed current node to LEADER node 2024-07-03 14:31:04.199: watchdog pid 58151: LOG: watchdog node state changed from [STANDING FOR LEADER] to [LEADER] health_check is failed on node 0 and it received a degenerated request for node 0 and the pgpool main process started quarantining staging-ha0001(5432) (shutting down) 2024-07-03 14:31:08.202: watchdog pid 58151: LOG: setting the local node "staging-ha0002: Linux staging-ha0002" as watchdog cluster leader 2024-07-03 14:31:08.202: watchdog pid 58151: LOG: signal_user1_to_parent_with_reason(1) 2024-07-03 14:31:08.202: watchdog pid 58151: LOG: I am the cluster leader node but we do not have enough nodes in cluster 2024-07-03 14:31:08.202: watchdog pid 58151: DETAIL: waiting for the quorum to start escalation process 2024-07-03 14:31:08.202: main pid 58147: LOG: Pgpool-II parent process received SIGUSR1 2024-07-03 14:31:08.202: main pid 58147: LOG: Pgpool-II parent process received watchdog state change signal from watchdog 2024-07-03 14:31:08.899: health_check0 pid 58188: LOG: failed to connect to PostgreSQL server on "staging-ha0001:5432", timed out 2024-07-03 14:31:08.899: health_check0 pid 58188: LOG: health check failed on node 0 (timeout:0) 2024-07-03 14:31:08.899: health_check0
Replication lag in Postgres
Hello everyone. Firstly thanks to the community members who are addressing all the queries that are posted. Those give us more insights about the issues/doubts in the postgres. I have a question with postgres HA setup. We are setting up a 2 node postgres cluster with async streaming replication, we want to define a RPO (Recovery point objective) in case of primary failure. How can we best define the RPO in this setup? since it's an async streaming replication setup there might be a chance of data loss which is proportional to the replication delay. Is there any way we can configure the delay duration, like for example to make sure every 10 mins the standby sync has to happen with primary? Thank you Regards Mukesh T
Re: Replication lag in Postgres
Thank you for the information Laurenz Albe On Fri, Jul 12, 2024 at 9:13 PM Laurenz Albe wrote: > On Fri, 2024-07-12 at 20:41 +0530, Mukesh Tanuku wrote: > > I have a question with postgres HA setup. > > We are setting up a 2 node postgres cluster with async streaming > replication, we want to > > define a RPO (Recovery point objective) in case of primary failure. > > > > How can we best define the RPO in this setup? since it's an async > streaming replication > > setup there might be a chance of data loss which is proportional to the > replication delay. > > > > Is there any way we can configure the delay duration, like for example > to make sure every > > 10 mins the standby sync has to happen with primary? > > When there is a delay, it is usually because replay at the standby is > delayed. > The WAL information is still replicated. You won't lose that information > on > failover; it will just make the failover take longer. > > Unless you have a network problem, you should never lose more than a > fraction > of a second. > > Yours, > Laurenz Albe >
repmgr 5.4.1 dependend package on RHEL9
Hello everyone, Question: We are installing *repmgr 5.4.1* along with postgres 15.6 on RHEL 9 and facing an issue with the dependent development package json-c *-devel-0.14-11.el9.x86_64**. *This package is not available on RHEL repository so we have to enable EPEL repo (EPEL repository is a community supported repository hosted by the Fedora Community) to download the development tool on the RHEL9. There are some additional risks going with this process. So what is the risk free way to use repmgr in RHEL9 Thanks to the community for active & prompt responses. Regards Mukesh T
PG-15.6: timeout parameters erroring out
Hello team, We unabled the postgres timeout parameters in the postgresql.conf file *idle_in_transaction_session_timeout = '1min'idle_session_timeout = '5min'* other way also, like below *idle_in_transaction_session_timeout = 6idle_session_timeout = 30* we see these errors reporting in the logs after we enabled those parameters *log:* 2025-02-10 04:17:19.156 GMT [2467573] ERROR: trailing junk after numeric literal at or near "1m" at character 43 2025-02-10 04:17:19.156 GMT [2467573] STATEMENT: SET idle_in_transaction_session_timeout = 1min 2025-02-10 04:17:19.845 GMT [2467575] ERROR: trailing junk after numeric literal at or near "1m" at character 43 2025-02-10 04:17:19.845 GMT [2467575] STATEMENT: SET idle_in_transaction_session_timeout = 1min ... ... ... 2025-02-10 04:22:00.823 GMT [2468337] ERROR: trailing junk after numeric literal at or near "5m" at character 28 2025-02-10 04:22:00.823 GMT [2468337] STATEMENT: SET idle_session_timeout = 5min 2025-02-10 04:22:03.487 GMT [2468342] ERROR: trailing junk after numeric literal at or near "5m" at character 28 2025-02-10 04:22:03.487 GMT [2468342] STATEMENT: SET idle_session_timeout = 5min 2025-02-10 04:22:04.470 GMT [2468345] ERROR: trailing junk after numeric literal at or near "5m" at character 28 Is there any know issue/bug with these paramters, can someone help me to get this worked since our app idle connections are getting pilled up and reaching max connections. I really appreciate your help. Thank you. Regards Mukesh Tanuku
Re: PG-15.6: timeout parameters erroring out
Thanks for your quick responses. I have just added the session related timeout parameters in the postgresql.conf file also tried tweaking with different values/patterns like ('1min', '1 min', 8, 10s, '10s', '10 s') but i get an immediate error after reloading the configuration file. *postgres.log file* 2025-02-10 10:27:10.748 GMT [934108] LOG: received SIGHUP, reloading configuration files 2025-02-10 10:27:10.749 GMT [934108] LOG: parameter "idle_in_transaction_session_timeout" changed to "8" 2025-02-10 10:27:16.117 GMT [2531150] ERROR: trailing junk after numeric literal at or near "80s" at character 43 2025-02-10 10:27:16.117 GMT [2531150] STATEMENT: SET idle_in_transaction_session_timeout = 80s though we are not firing any SQL statements explicitly we see this error. i guess something else is the issue other than spacing. Regards Mukesh Tanuku On Mon, Feb 10, 2025 at 10:38 AM David G. Johnston < david.g.johns...@gmail.com> wrote: > On Sunday, February 9, 2025, Tom Lane wrote: > >> Mukesh Tanuku writes: >> > We unabled the postgres timeout parameters in the postgresql.conf file >> >> > *idle_in_transaction_session_timeout = '1min'idle_session_timeout = >> '5min'* >> > Did you actually run them together on one line like that? > > >> > I think that is an email client artifact, dealing with RTF or something. > > I do agree too much is missing here. The only known errors involve SET > which configuration file settings don’t use. And those errors are indeed > the lack of quoting. > > The lack of space I mentioned earlier, in the config text literal, is fine. > > David J. > >
Re: PG-15.6: timeout parameters erroring out
Thanks for clarifying, got it. Is this issue specifically reported in only the PG 15 version? more syntax sensitivity. On Mon, Feb 10, 2025 at 6:09 PM rob stone wrote: > > > On Mon, 2025-02-10 at 16:06 +0530, Mukesh Tanuku wrote: > > Thanks for your quick responses. > > > > I have just added the session related timeout parameters in the > > postgresql.conf file also tried tweaking with different > > values/patterns like ('1min', '1 min', 8, 10s, '10s', '10 s') > > > > but i get an immediate error after reloading the configuration file. > > > > postgres.log file > > 2025-02-10 10:27:10.748 GMT [934108] LOG: received SIGHUP, reloading > > configuration files > > 2025-02-10 10:27:10.749 GMT [934108] LOG: parameter > > "idle_in_transaction_session_timeout" changed to "8" > > 2025-02-10 10:27:16.117 GMT [2531150] ERROR: trailing junk after > > numeric literal at or near "80s" at character 43 > > 2025-02-10 10:27:16.117 GMT [2531150] STATEMENT: SET > > idle_in_transaction_session_timeout = 80s > > > > though we are not firing any SQL statements explicitly we see this > > error. > > > > i guess something else is the issue other than spacing. > > > > > > I'd say that 934108 is the config file reload and that 2531150 is from > somebody running psql from a shell or SQL code coming out from an > application. > > > > > >
Re: PG-15.6: timeout parameters erroring out
Thanks for all of you for responding and clarifying things. Really appreciate it. We finally confirmed that it is a SET command issue where the AWX application is firing the query without quotes and it's a known bug from their end. Thank you once again. Regards Mukesh Tanuku On Mon, Feb 10, 2025 at 9:30 PM Tom Lane wrote: > Mukesh Tanuku writes: > > Is this issue specifically reported in only the PG 15 version? more > syntax > > sensitivity. > > PG 15 and up report the syntax error differently. > > regression=# set foo = 15min; > ERROR: trailing junk after numeric literal at or near "15min" > LINE 1: set foo = 15min; > ^ > > In v14 that'd look like > > regression=# set foo = 15min; > ERROR: syntax error at or near "min" > LINE 1: set foo = 15min; > ^ > > There are variants of this (lack of space between a number and an > identifier) that will be rejected by 15+ although older versions > accepted them. But in a SET command I think no version will take it. > > regards, tom lane >
Re: PG-15.6: timeout parameters erroring out
I see there is some open PR in awx-operator ( https://github.com/ansible/awx/issues/15406) which is related to this. AWX is connected to DB in my case. On Mon, Feb 10, 2025 at 7:04 PM Mukesh Tanuku wrote: > Thanks for clarifying, got it. > Is this issue specifically reported in only the PG 15 version? more syntax > sensitivity. > > On Mon, Feb 10, 2025 at 6:09 PM rob stone wrote: > >> >> >> On Mon, 2025-02-10 at 16:06 +0530, Mukesh Tanuku wrote: >> > Thanks for your quick responses. >> > >> > I have just added the session related timeout parameters in the >> > postgresql.conf file also tried tweaking with different >> > values/patterns like ('1min', '1 min', 8, 10s, '10s', '10 s') >> > >> > but i get an immediate error after reloading the configuration file. >> > >> > postgres.log file >> > 2025-02-10 10:27:10.748 GMT [934108] LOG: received SIGHUP, reloading >> > configuration files >> > 2025-02-10 10:27:10.749 GMT [934108] LOG: parameter >> > "idle_in_transaction_session_timeout" changed to "8" >> > 2025-02-10 10:27:16.117 GMT [2531150] ERROR: trailing junk after >> > numeric literal at or near "80s" at character 43 >> > 2025-02-10 10:27:16.117 GMT [2531150] STATEMENT: SET >> > idle_in_transaction_session_timeout = 80s >> > >> > though we are not firing any SQL statements explicitly we see this >> > error. >> > >> > i guess something else is the issue other than spacing. >> > >> > >> >> I'd say that 934108 is the config file reload and that 2531150 is from >> somebody running psql from a shell or SQL code coming out from an >> application. >> >> >> >> > >> >
pgpool Connection issue: ERROR: unable to read message kind
Hello Everyone, We have a Postgres setup in Azure (IAAS) with 2 DB nodes in the cluster and pgpool is running on each DB node. We have enabled the watchdog. What we are doing: We are testing the connectivity to DB cluster via HammerDB to apply the load. HammerDB is connecting to Delegate IP via port. Issue: when we initiate the connection it gives error, but when we re-initiate the connection again then it is successful. What could be the issue Error message we are seeing when we initiate the connection for the first time. Error in Virtual User 1: Error: Connection to database failed connection to server at "10.35.8.4", port failed: ERROR: unable to read message kind DETAIL: kind does not match between main(53) slot[1] (45) If you want to see any configuration settings that i made, i can share. Please help me in this regard. Thank you Mukesh Tanuku