Re: [pgpool-general: 9106] Postgres/pgpool HA failover process

2024-05-30 Thread Mukesh Tanuku
BTW here is the network configuration we set as

UTOCONNECT_PRIORITY=120
BOOTPROTO=dhcp
DEVICE=eth0
DHCPV6C=yes
HWADDR=
IPV6INIT=yes
ONBOOT=yes
TYPE=Ethernet
USERCTL=no

Regards
Mukesh Tanuku

On Wed, May 29, 2024 at 12:53 PM Mukesh Tanuku 
wrote:

> Thanks Bo for your response to my email.
>
> Below is the parameters:
>
>
>
> * if_up_cmd = '/usr/bin/sudo /sbin/ip addr add $_IP_$/24 dev enp0s8 label
> enp0s8:0'  if_down_cmd = '/usr/bin/sudo /sbin/ip addr del $_IP_$/24 dev
> enp0s8'  arping_cmd = '/usr/bin/sudo /usr/sbin/arping -U $_IP_$ -w 1 -I
> enp0s8'*
>
> The above commands are working on Leader node. and once the VIP is up
> successfully then we can run the pcp commands via VIP host work only from
> the leader node locally. if we try to execute the PCP comands via VIP
> throws an error
> ERROR: connection to host "" failed with error "No route to host"
>
> What's the possible cause in this case?
>
> Regards
> Mukesh Tanuku
>
>
> On Tue, May 28, 2024 at 5:30 AM Bo Peng  wrote:
>
>> Hi,
>>
>> > Hello Everyone,
>> >
>> > We are working on to setup a High available solution for postgres and
>> > running some test cases
>> > based on that we need some clarification on the HA setup as per the same
>> > configuration given in this example
>> > <https://www.pgpool.net/docs/42/en/html/example-cluster.html>
>>
>> If you want to configure a new cluster,
>> we recommend using the latest version.
>>
>> https://www.pgpool.net/docs/latest/en/html/example-cluster.html
>>
>> > Question - 01
>> > Regarding the active client connections that are connecting via delegate
>> > VIP, what happens to those connections during the failover process? If
>> > either the primary DB goes down (gracefully/crash) then the
>> > active connection will get hung till the new primary DB is available or
>> > those active connections get terminated?
>>
>> The existing connections will be disconnected if primary DB is down.
>> However, the subsequent connections will be routed to the new primary.
>>
>> > What if the same case for Leader pgpool service as well?
>>
>> Yes.
>>
>> > Questions - 02
>> > Configuring the delegate IP:
>> > a. Are there any standards/limitations for choosing the unused IP as
>> > delegate IP? I mean like the IP must be within the same subnet range of
>> all
>> > the nodes in the cluster?
>>
>> No limitations in Pgpool-II, it depends on the OS.
>> Pgpool-II using "ip addr ..." command to assign or release a delegate IP.
>>
>> Below is the parameters:
>>
>>   if_up_cmd = '/usr/bin/sudo /sbin/ip addr add $_IP_$/24 dev enp0s8 label
>> enp0s8:0'
>>   if_down_cmd = '/usr/bin/sudo /sbin/ip addr del $_IP_$/24 dev enp0s8'
>>   arping_cmd = '/usr/bin/sudo /usr/sbin/arping -U $_IP_$ -w 1 -I enp0s8'
>>
>> You need to make sure the commands work.
>>
>> > b. Any challenges we see for non-root user to work with VIP? because we
>> run
>> > the pgpool with non-root user.
>>
>> You need to make sure the non-root user can execute ip and arping command
>> with sudo
>> without a password.
>>
>> --
>> Bo Peng 
>> SRA OSS LLC
>> TEL: 03-5979-2701 FAX: 03-5979-2702
>> URL: https://www.sraoss.co.jp/
>>
>


Pgpool delegate IP is not reachable from the remote host

2024-06-18 Thread Mukesh Tanuku
Hello everyone.

We have a pgpool & postgres setup made as per the given example
<https://www.pgpool.net/docs/latest/en/html/example-cluster.html>

We haven't configured the *if_up_cmd, if_down_cmd & **arping_cmd* in
pgpool.conf
(we commented it out on purpose). We enabled the *delegate_ip* and assigned
a unused IP.

While we bringing up the pgpool the log shows VIP is successfully UP via
if_up_cmd, how is this happening? any default way is considering the
if_up_cmd to bring the VIP?



*2024-06-18 08:27:02.735: watchdog_utility pid 102493: LOG:  successfully
acquired the delegate IP:"10.127.1.20"2024-06-18 08:27:02.735:
watchdog_utility pid 102493: DETAIL:  'if_up_cmd' returned with
success2024-06-18 08:27:02.735: watchdog pid 102378: LOG:  watchdog
escalation process with pid: 102493 exit with SUCCESS.*

NOTE: We run into some situation so that we disabled the if_* commands in
the configuration and started pgpool, but VIP is getting acquired by the
leader pgpool node even the if_* commands are not used in pgpool.conf file.

Can anyone explain how the VIP works in pgpool watchdog

Thank you
Regards
Mukesh Tanuku


Query 2 Node HA test case result

2024-07-03 Thread Mukesh Tanuku
Hello everyone,
We are doing a POC on postgres HA setup with streaming replication (async)
using pgpool-II as a load balancing  & connection pooling and repmgr for
setting up HA & automatic failover.
We are applying a test case, like isolating the VM1 node from the Network
completely for more than 2 mins and again plug-in back the network, since
we want to verify how the system works during network glitches, any chances
of split-brain or so.
Our current setup looks like below,
2 VM's on Azure cloud, each VM has Postgres running along with Pgpool
service.
[image: image.png]

We enabled watchdog and assigned a delegate IP
*NOTE: as per some limitations we are using a floating IP and used for
delegate IP.*

During the test, here are our observations:
1. Client connections got hung from the time the VM1 got lost from the
network and till VM1 gets back to normal.
2. Once the VM1 is lost then Pgpool promotes the VM2 as LEADER node and
Postgres Standby got promoted to Primary on VM2 as well, but still client
connections are not connecting to the new primary. Why is this not
happening?
3. Once the VM1 is back to network, there is a split brain situation, where
pgpool on VM1 takes the lead to become LEADER node (pgpool.log shows). and
from then the client connects to the VM1 node via VIP.

*pgpool.conf *

sr_check_period  10sec

health_check_period  30sec

health_check_timeout 20 sec

health_check_max_retries  3

health_check_retry_delay 1

wd_lifecheck_method = 'heartbeat'

wd_interval = 10

wd_heartbeat_keepalive = 2

wd_heartbeat_deadtime = 30


*Logs information: *

>From VM2:

Pgpool.log

14:30:17  N/w disconnected

After 10 sec the streaming replication check failed and got timed out.

2024-07-03 14:30:26.176: sr_check_worker pid 58187: LOG:  failed to connect
to PostgreSQL server on "staging-ha0001:5432", timed out



Then pgpool failed to do health check since it got timed out as per
health_check_timeout set to 20 sec

2024-07-03 14:30:35.869: health_check0 pid 58188: LOG:  failed to connect
to PostgreSQL server on "staging-ha0001:5432", timed out



Re-trying health_check  & sr_check but again timed out.



2024-07-03 14:30:46.187: sr_check_worker pid 58187: LOG:  failed to connect
to PostgreSQL server on "staging-ha0001:5432", timed out

2024-07-03 14:30:46.880: health_check0 pid 58188: LOG:  failed to connect
to PostgreSQL server on "staging-ha0001:5432", timed out



Watchdog received a message saying the Leader node is lost.



2024-07-03 14:30:47.192: watchdog pid 58151: WARNING:  we have not received
a beacon message from leader node "staging-ha0001: Linux staging-ha0001"

2024-07-03 14:30:47.192: watchdog pid 58151: DETAIL:  requesting info
message from leader node

2024-07-03 14:30:54.312: watchdog pid 58151: LOG:  read from socket failed,
remote end closed the connection

2024-07-03 14:30:54.312: watchdog pid 58151: LOG:  client socket of
staging-ha0001: Linux staging-ha0001 is closed

2024-07-03 14:30:54.313: watchdog pid 58151: LOG:  remote node
"staging-ha0001: Linux staging-ha0001" is reporting that it has lost us

2024-07-03 14:30:54.313: watchdog pid 58151: LOG:  we are lost on the
leader node "staging-ha0001: Linux staging-ha0001"



Re-trying health_check  & sr_check but again timed out.



2024-07-03 14:30:57.888: health_check0 pid 58188: LOG:  failed to connect
to PostgreSQL server on "staging-ha0001:5432", timed out

2024-07-03 14:30:57.888: health_check0 pid 58188: LOG:  health check
retrying on DB node: 0 (round:3)

2024-07-03 14:31:06.201: sr_check_worker pid 58187: LOG:  failed to connect
to PostgreSQL server on "staging-ha0001:5432", timed out





After 10 sec from the time we lost the leader node,  watchdog changed
current node to LEADER node

2024-07-03 14:31:04.199: watchdog pid 58151: LOG:  watchdog node state
changed from [STANDING FOR LEADER] to [LEADER]





health_check is failed on node 0 and it received a degenerated request for
node 0  and the pgpool main process started quarantining
staging-ha0001(5432) (shutting down)



2024-07-03 14:31:08.202: watchdog pid 58151: LOG:  setting the local node
"staging-ha0002: Linux staging-ha0002" as watchdog cluster leader

2024-07-03 14:31:08.202: watchdog pid 58151: LOG:
signal_user1_to_parent_with_reason(1)

2024-07-03 14:31:08.202: watchdog pid 58151: LOG:  I am the cluster leader
node but we do not have enough nodes in cluster

2024-07-03 14:31:08.202: watchdog pid 58151: DETAIL:  waiting for the
quorum to start escalation process

2024-07-03 14:31:08.202: main pid 58147: LOG:  Pgpool-II parent process
received SIGUSR1

2024-07-03 14:31:08.202: main pid 58147: LOG:  Pgpool-II parent process
received watchdog state change signal from watchdog

2024-07-03 14:31:08.899: health_check0 pid 58188: LOG:  failed to connect
to PostgreSQL server on "staging-ha0001:5432", timed out

2024-07-03 14:31:08.899: health_check0 pid 58188: LOG:  health check failed
on node 0 (timeout:0)

2024-07-03 14:31:08.899: health_check0 

Replication lag in Postgres

2024-07-12 Thread Mukesh Tanuku
Hello everyone.
Firstly thanks to the community members who are addressing all the queries
that are posted. Those give us more insights about the issues/doubts in the
postgres.

I have a question with postgres HA setup.
We are setting up a 2 node postgres cluster with async streaming
replication, we want to define a RPO (Recovery point objective) in case of
primary failure.

How can we best define the RPO in this setup? since it's an async streaming
replication setup there might be a chance of data loss which is
proportional to the replication delay.

Is there any way we can configure the delay duration, like for example to
make sure every 10 mins the standby sync has to happen with primary?

Thank you
Regards
Mukesh T


Re: Replication lag in Postgres

2024-07-12 Thread Mukesh Tanuku
Thank you for the information Laurenz Albe

On Fri, Jul 12, 2024 at 9:13 PM Laurenz Albe 
wrote:

> On Fri, 2024-07-12 at 20:41 +0530, Mukesh Tanuku wrote:
> > I have a question with postgres HA setup.
> > We are setting up a 2 node postgres cluster with async streaming
> replication, we want to
> > define a RPO (Recovery point objective) in case of primary failure.
> >
> > How can we best define the RPO in this setup? since it's an async
> streaming replication
> > setup there might be a chance of data loss which is proportional to the
> replication delay.
> >
> > Is there any way we can configure the delay duration, like for example
> to make sure every
> > 10 mins the standby sync has to happen with primary?
>
> When there is a delay, it is usually because replay at the standby is
> delayed.
> The WAL information is still replicated.  You won't lose that information
> on
> failover; it will just make the failover take longer.
>
> Unless you have a network problem, you should never lose more than a
> fraction
> of a second.
>
> Yours,
> Laurenz Albe
>


repmgr 5.4.1 dependend package on RHEL9

2024-10-13 Thread Mukesh Tanuku
Hello everyone,

Question: We are installing *repmgr 5.4.1* along with postgres 15.6 on RHEL
9 and facing an issue with the dependent development package json-c
*-devel-0.14-11.el9.x86_64**.  *This package is not available on
RHEL repository so we have to enable EPEL repo (EPEL repository is a
community supported repository hosted by the Fedora Community) to download
the development tool on the RHEL9. There are some additional risks going
with this process. So what is the risk free way to use repmgr in RHEL9

Thanks to the community for active & prompt responses.

Regards
Mukesh T


PG-15.6: timeout parameters erroring out

2025-02-09 Thread Mukesh Tanuku
Hello team,

We unabled the postgres timeout parameters in the postgresql.conf file


*idle_in_transaction_session_timeout = '1min'idle_session_timeout = '5min'*

other way also, like below

*idle_in_transaction_session_timeout = 6idle_session_timeout = 30*

we see these errors reporting in the logs after we enabled those parameters
*log:*
2025-02-10 04:17:19.156 GMT [2467573] ERROR:  trailing junk after numeric
literal at or near "1m" at character 43
2025-02-10 04:17:19.156 GMT [2467573] STATEMENT:  SET
idle_in_transaction_session_timeout = 1min
2025-02-10 04:17:19.845 GMT [2467575] ERROR:  trailing junk after numeric
literal at or near "1m" at character 43
2025-02-10 04:17:19.845 GMT [2467575] STATEMENT:  SET
idle_in_transaction_session_timeout = 1min
...
...
...

2025-02-10 04:22:00.823 GMT [2468337] ERROR:  trailing junk after numeric
literal at or near "5m" at character 28
2025-02-10 04:22:00.823 GMT [2468337] STATEMENT:  SET idle_session_timeout
= 5min
2025-02-10 04:22:03.487 GMT [2468342] ERROR:  trailing junk after numeric
literal at or near "5m" at character 28
2025-02-10 04:22:03.487 GMT [2468342] STATEMENT:  SET idle_session_timeout
= 5min
2025-02-10 04:22:04.470 GMT [2468345] ERROR:  trailing junk after numeric
literal at or near "5m" at character 28

Is there any know issue/bug with these paramters, can someone help me to
get this worked since our app idle connections are getting pilled up and
reaching max connections.

I really appreciate your help. Thank you.

Regards
Mukesh Tanuku


Re: PG-15.6: timeout parameters erroring out

2025-02-10 Thread Mukesh Tanuku
Thanks for your quick responses.

I have just added the session related timeout parameters in the
postgresql.conf file also tried tweaking with different values/patterns
like ('1min', '1 min', 8, 10s, '10s', '10 s')

but i get an immediate error after reloading the configuration file.

*postgres.log file*
2025-02-10 10:27:10.748 GMT [934108] LOG:  received SIGHUP, reloading
configuration files
2025-02-10 10:27:10.749 GMT [934108] LOG:  parameter
"idle_in_transaction_session_timeout" changed to "8"
2025-02-10 10:27:16.117 GMT [2531150] ERROR:  trailing junk after numeric
literal at or near "80s" at character 43
2025-02-10 10:27:16.117 GMT [2531150] STATEMENT:  SET
idle_in_transaction_session_timeout = 80s

though we are not firing any SQL statements explicitly we see this error.

i guess something else is the issue other than spacing.

Regards
Mukesh Tanuku


On Mon, Feb 10, 2025 at 10:38 AM David G. Johnston <
david.g.johns...@gmail.com> wrote:

> On Sunday, February 9, 2025, Tom Lane  wrote:
>
>> Mukesh Tanuku  writes:
>> > We unabled the postgres timeout parameters in the postgresql.conf file
>>
>> > *idle_in_transaction_session_timeout = '1min'idle_session_timeout =
>> '5min'*
>>
> Did you actually run them together on one line like that?
>
>
>>
> I think that is an email client artifact, dealing with RTF or something.
>
> I do agree too much is missing here.  The only known errors involve SET
> which configuration file settings don’t use.  And those errors are indeed
> the lack of quoting.
>
> The lack of space I mentioned earlier, in the config text literal, is fine.
>
> David J.
>
>


Re: PG-15.6: timeout parameters erroring out

2025-02-10 Thread Mukesh Tanuku
Thanks for clarifying, got it.
Is this issue specifically reported in only the PG 15 version? more syntax
sensitivity.

On Mon, Feb 10, 2025 at 6:09 PM rob stone  wrote:

>
>
> On Mon, 2025-02-10 at 16:06 +0530, Mukesh Tanuku wrote:
> > Thanks for your quick responses.
> >
> > I have just added the session related timeout parameters in the
> > postgresql.conf file also tried tweaking with different
> > values/patterns like ('1min', '1 min', 8, 10s, '10s', '10 s')
> >
> > but i get an immediate error after reloading the configuration file.
> >
> > postgres.log file
> > 2025-02-10 10:27:10.748 GMT [934108] LOG:  received SIGHUP, reloading
> > configuration files
> > 2025-02-10 10:27:10.749 GMT [934108] LOG:  parameter
> > "idle_in_transaction_session_timeout" changed to "8"
> > 2025-02-10 10:27:16.117 GMT [2531150] ERROR:  trailing junk after
> > numeric literal at or near "80s" at character 43
> > 2025-02-10 10:27:16.117 GMT [2531150] STATEMENT:  SET
> > idle_in_transaction_session_timeout = 80s
> >
> > though we are not firing any SQL statements explicitly we see this
> > error.
> >
> > i guess something else is the issue other than spacing.
> >
> >
>
> I'd say that 934108 is the config file reload and that 2531150 is from
> somebody running psql from a shell or SQL code coming out from an
> application.
>
>
>
> >
>


Re: PG-15.6: timeout parameters erroring out

2025-02-10 Thread Mukesh Tanuku
Thanks for all of you for responding and clarifying things. Really
appreciate it.

We finally confirmed that it is a SET command issue where the AWX
application is firing the query without quotes and it's a known bug from
their end.

Thank you once again.
Regards
Mukesh Tanuku

On Mon, Feb 10, 2025 at 9:30 PM Tom Lane  wrote:

> Mukesh Tanuku  writes:
> > Is this issue specifically reported in only the PG 15 version? more
> syntax
> > sensitivity.
>
> PG 15 and up report the syntax error differently.
>
> regression=# set foo = 15min;
> ERROR:  trailing junk after numeric literal at or near "15min"
> LINE 1: set foo = 15min;
>   ^
>
> In v14 that'd look like
>
> regression=# set foo = 15min;
> ERROR:  syntax error at or near "min"
> LINE 1: set foo = 15min;
> ^
>
> There are variants of this (lack of space between a number and an
> identifier) that will be rejected by 15+ although older versions
> accepted them.  But in a SET command I think no version will take it.
>
> regards, tom lane
>


Re: PG-15.6: timeout parameters erroring out

2025-02-10 Thread Mukesh Tanuku
I see there is some open PR in awx-operator (
https://github.com/ansible/awx/issues/15406) which is related to this. AWX
is connected to DB in my case.

On Mon, Feb 10, 2025 at 7:04 PM Mukesh Tanuku 
wrote:

> Thanks for clarifying, got it.
> Is this issue specifically reported in only the PG 15 version? more syntax
> sensitivity.
>
> On Mon, Feb 10, 2025 at 6:09 PM rob stone  wrote:
>
>>
>>
>> On Mon, 2025-02-10 at 16:06 +0530, Mukesh Tanuku wrote:
>> > Thanks for your quick responses.
>> >
>> > I have just added the session related timeout parameters in the
>> > postgresql.conf file also tried tweaking with different
>> > values/patterns like ('1min', '1 min', 8, 10s, '10s', '10 s')
>> >
>> > but i get an immediate error after reloading the configuration file.
>> >
>> > postgres.log file
>> > 2025-02-10 10:27:10.748 GMT [934108] LOG:  received SIGHUP, reloading
>> > configuration files
>> > 2025-02-10 10:27:10.749 GMT [934108] LOG:  parameter
>> > "idle_in_transaction_session_timeout" changed to "8"
>> > 2025-02-10 10:27:16.117 GMT [2531150] ERROR:  trailing junk after
>> > numeric literal at or near "80s" at character 43
>> > 2025-02-10 10:27:16.117 GMT [2531150] STATEMENT:  SET
>> > idle_in_transaction_session_timeout = 80s
>> >
>> > though we are not firing any SQL statements explicitly we see this
>> > error.
>> >
>> > i guess something else is the issue other than spacing.
>> >
>> >
>>
>> I'd say that 934108 is the config file reload and that 2531150 is from
>> somebody running psql from a shell or SQL code coming out from an
>> application.
>>
>>
>>
>> >
>>
>


pgpool Connection issue: ERROR: unable to read message kind

2024-12-20 Thread Mukesh Tanuku
Hello Everyone,

We have a Postgres setup in Azure (IAAS) with 2 DB nodes in the cluster and
pgpool is running on each DB node. We have enabled the watchdog.

What we are doing: We are testing the connectivity to DB cluster via
HammerDB to apply the load.
HammerDB is connecting to Delegate IP via  port.

Issue: when we initiate the connection it gives error, but when we
re-initiate the connection again then it is successful. What could be the
issue

Error message we are seeing when we initiate the connection for the first
time.

Error in Virtual User 1: Error: Connection to database failed
connection to server at "10.35.8.4", port  failed: ERROR:  unable to
read message kind
DETAIL:  kind does not match between main(53) slot[1] (45)

If you want to see any configuration settings that i made, i can share.

Please help me in this regard.

Thank you
Mukesh Tanuku