Re: Postgresql database terminates abruptly with too many open files error

2025-01-15 Thread Sri Mrudula Attili

Hello Tom,


The max_connections =200 and max_files_per_process =1000 as you mentioned.

So should a max_files_per_process =500 helps?


I could see from the number of connections to the databaseis not 
exceeding 20. But still these 20 are causing all the damage or using up 
all the openfiles quota.



Thanks,

Sri

On 14/01/2025 14:19, Tom Lane wrote:

Sri Mrudula Attili  writes:

   We have a postgresql VDB(virtual database- Delphix) that keeps
terminating due "to too many open files".

What do you have max_connections set to, and how many actually-live
server processes are there typically?


The number of allowed openfiles at OS level are 65000.

I'm suspecting that you either need to increase that, decrease
max_files_per_process (which defaults to 1000), or decrease
max_connections.

regards, tom lane





Re: Data Out of Sync with Physical Streaming Replication

2025-01-15 Thread Tomas Vondra
On 1/15/25 19:50, Tim Gerber wrote:
> Hi All,
>
>...
>
> Archive mode is on.   I ran pg_amcheck and everything came back clean. I
> know synchronous replication is an option, but it feels like something
> else is going on and would like to get to the bottom of it.
> 
> Any ideas on what could be causing this?  I’m planning on turning on
> data_checksums and running pg_checksums to see if any corruption is found.
> 

No idea. If the storage has some issues (which is the only thing data
checksums could catch, in some cases), then anything is possible.

I think it'd be interesting to compare the tuples returned on primary
vs. standby, including system columns etc. Ideally using pageinspect.
That might tell you what exactly is the difference, and perhaps also say
which XID to look at.

regards

-- 
Tomas Vondra





Re: Intermittent errors when fetching cursor rows on PostgreSQL 16

2025-01-15 Thread Enrico Schenone

Il 15/01/25 17:03, Adrian Klaver ha scritto:

On 1/14/25 23:46, Enrico Schenone wrote:


Which environment did you run the recent prolonged test against?
The first one: Distinct application server and DB server on distinct 
subnets.




Is that the same as where you saw the errors?

Yes.



*Enrico Schenone*
Software Architect








Re: Intermittent errors when fetching cursor rows on PostgreSQL 16

2025-01-15 Thread Adrian Klaver

On 1/14/25 23:46, Enrico Schenone wrote:


Which environment did you run the recent prolonged test against?
The first one: Distinct application server and DB server on distinct 
subnets.




Is that the same as where you saw the errors?


*Enrico Schenone*
Software Architect


--
Adrian Klaver
adrian.kla...@aklaver.com





Data Out of Sync with Physical Streaming Replication

2025-01-15 Thread Tim Gerber
Hi All,

We have run into an issue where the data for a primary and standby Postgres
instances are out of sync. We are using streaming physical replication.  We
have encountered this issue several times over the last year on different
Postgres Clusters.  Postgres indicates everything is in sync but we can
clearly see that is not the case.



The essentials:



PG Verion: 14.7

OS: Rocky Linux 8.10  -> Linux 4.18.0-553.30.1.el8_10.x86_64

Server: VMWare VM

Storage:  Enterprise level SSD



We are starting to push Postgres 17 but it will be a while before we
upgrade our existing PG 14 DB’s, therefore, I need to figure out why this
occurs in PG 14. I plan on at a minimum, upgrading to PG 14.15.



I have a cluster setup to show this. We are using the default asynchronous
replication.   The primary and standby are in the same data center.   I
haven’t been able to reproduce this on demand.  Initially, I thought it
might be a storage issue, but we are not seeing any errors in dmesg, kernel
log, journalctl, or on the storage controller.



I’m hoping someone has seen something similar and has some guidance on
further troubleshooting this.





On the PRIMARY, we can see the replication slot is running:



--  DB is in read-write mode:



postgres=# select pg_is_in_recovery();

 pg_is_in_recovery

---

 f





--  We have single replication slot that shows the lsn written, sent,
flushed, and replayed is all the same:

postgres=# select * from pg_stat_replication;

-[ RECORD 1 ]+--

pid  | 2457681

usesysid | 16384

usename  | xx

application_name | 

client_addr  | yyy.yyy.yyy.yyy

client_hostname  |

client_port  | 58522

backend_start| 2025-01-15 11:03:26.99902-06

backend_xmin |

state| streaming

sent_lsn | 0/1E0001B8

write_lsn| 0/1E0001B8

flush_lsn| 0/1E0001B8

replay_lsn   | 0/1E0001B8

write_lag|

flush_lag|

replay_lag   |

sync_priority| 0

sync_state   | async

reply_time   | 2025-01-15 11:04:27.149521-06











--  Primary shows no database conflicts:



postgres=# select * from  pg_stat_database_conflicts;

-[ RECORD 1 ]+--

datid| 1

datname  | template1

confl_tablespace | 0

confl_lock   | 0

confl_snapshot   | 0

confl_bufferpin  | 0

confl_deadlock   | 0

-[ RECORD 2 ]+--

datid| 13747

datname  | template0

confl_tablespace | 0

confl_lock   | 0

confl_snapshot   | 0

confl_bufferpin  | 0

confl_deadlock   | 0

-[ RECORD 3 ]+--

datid| 16390

datname  | issuedb

confl_tablespace | 0

confl_lock   | 0

confl_snapshot   | 0

confl_bufferpin  | 0

confl_deadlock   | 0

-[ RECORD 4 ]+--

datid| 13748

datname  | postgres

confl_tablespace | 0

confl_lock   | 0

confl_snapshot   | 0

confl_bufferpin  | 0

confl_deadlock   | 0















On the STANDBY server, we can see that there is no lag:



--  Instance is in recovery mode:

postgres=# select pg_is_in_recovery();

 pg_is_in_recovery

---

 t





--  Received and replayed lsn are the same. Note, these also match the lsn
from the primary.



postgres=# select pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();

-[ RECORD 1 ]---+---

pg_last_wal_receive_lsn | 0/1E0001B8

pg_last_wal_replay_lsn  | 0/1E0001B8









--   no conflicts:



postgres=# select * from  pg_stat_database_conflicts;

-[ RECORD 1 ]+--

datid| 1

datname  | template1

confl_tablespace | 0

confl_lock   | 0

confl_snapshot   | 0

confl_bufferpin  | 0

confl_deadlock   | 0

-[ RECORD 2 ]+--

datid| 13747

datname  | template0

confl_tablespace | 0

confl_lock   | 0

confl_snapshot   | 0

confl_bufferpin  | 0

confl_deadlock   | 0

-[ RECORD 3 ]+--

datid| 16390

datname  | issuedb

confl_tablespace | 0

confl_lock   | 0

confl_snapshot   | 0

confl_bufferpin  | 0

confl_deadlock   | 0

-[ RECORD 4 ]+--

datid| 13748

datname  | postgres

confl_tablespace | 0

confl_lock   | 0

confl_snapshot   | 0

confl_bufferpin  | 0

confl_deadlock   | 0











I've performed a switchover multiple times meaning the standby has become
the primary and the primary has become the standby.

Not all tables are out of sync.  In this case, it is three of them:





PRIMARY:



issuedb=# select count(1) from issue;

 count

---

 2

(1 row)





issuedb=# select count(1) from comment;

 count

---

 1

(1 row)



issuedb=# select count(1) from troubleshoot_issue;

 count

---

 2

(1 row)





STANDBY:



issuedb=# select count(1) from issue;

 count

---

 3

(1 row)





issuedb=# select count(1) from comment;

 count

---

 4

(1 row)



issuedb

Re: Data Out of Sync with Physical Streaming Replication

2025-01-15 Thread Tim Gerber
Thanks for the thoughts, Tomas.

I suspected storage only because data was missing.  Everything I've looked
at is not showing any indication of this.

Here's the tuple data for each.  Appears to look normal to me.


ISSUE table:

PRIMARY:

issuedb=#=#   SELECT ctid, xmin, xmax, issue_id as issue_id from issue;
 ctid  | xmin | xmax |   issue_id
---+--+--+--
 (0,1) | 3439 | 3443 | 0192d4c1-7220-7780-be76-e67f955a719a
 (0,2) | 3439 | 3439 | 0192d4c1-7233-725c-a97a-0acc82fa2c2a
 (0,5) | 3724 | 3724 | 01934472-b052-7ce2-9c7d-35a809bcb3f6
(3 rows)

issuedb=#=#  SELECT t_xmin, t_xmax, tuple_data_split('issue'::regclass,
t_data, t_infomask, t_infomask2, t_bits)
issuedb=#-#  FROM heap_page_items(get_raw_page('issue', 0));
 t_xmin | t_xmax |
tuple_data_split
++---
   3439 |   3443 |
{"\\x0192d4c172207780be76e67f955a719a","\\x0d3534333231","\\x1b3031322d31313131312d3031","\\x01",NULL,NULL,"\\x01",NULL,NULL,NULL,"\\xbc6d6ed88dc80200","\\x1d62756b6174796e6963686f6c61",NULL,NULL}
   3439 |   3439 |
{"\\x0192d4c17233725ca97a0acc82fa2c2a","\\x0f363534333231","\\x1b3031322d31313131312d3032","\\x01",NULL,NULL,"\\x00",NULL,NULL,NULL,"\\xb5986ed88dc80200","\\x1d62756b6174796e6963686f6c61",NULL,NULL}
   3698 |   3711 |
{"\\x019342bde8217af59ebbb1315860c475","\\x0531","\\x0532","\\x01",NULL,NULL,"\\x00","\\x01",NULL,NULL,"\\xf5f19a7a3bca0200","\\x1767696c6c696c616e646a",NULL,NULL}
   3717 |   3719 |
{"\\x0193446fa7de79c6a6c4bc508ab31f30","\\x15334231323334353637","\\x1b3031322d30303030302d3030","\\x01",NULL,NULL,"\\x00","\\x01",NULL,NULL,"\\x66eeef1842ca0200","\\x1767696c6c696c616e646a",NULL,NULL}
   3724 |   3724 |
{"\\x01934472b0527ce29c7d35a809bcb3f6","\\x133141313233343536","\\x1b3031322d30303030302d3030","\\x01",NULL,NULL,"\\x00","\\x01",NULL,NULL,"\\x67f1c82442ca0200","\\x1767696c6c696c616e646a",NULL,NULL}
(5 rows)

STANDBY:

 issuedb=# SELECT ctid, xmin, xmax, issue_id as issue_id from issue;
 ctid  | xmin | xmax |   issue_id
---+--+--+--
 (0,1) | 3439 | 3443 | 0192d4c1-7220-7780-be76-e67f955a719a
 (0,2) | 3439 | 3439 | 0192d4c1-7233-725c-a97a-0acc82fa2c2a
(2 rows)

issuedb=#  SELECT t_xmin, t_xmax, tuple_data_split('issue'::regclass,
t_data, t_infomask, t_infomask2, t_bits)
issuedb-#  FROM heap_page_items(get_raw_page('issue', 0));
 t_xmin | t_xmax |
   tuple_data_split
++
   3439 |   3443 |
{"\\x0192d4c172207780be76e67f955a719a","\\x0d3534333231","\\x1b3031322d31313131312d3031","\\x01",NULL,NULL,"\\x01",NULL,NULL,NULL,"\\xbc6d6ed88dc80200","\\x1d62756b6174796e6963686f6c61",NULL,NULL}
   3439 |   3439 |
{"\\x0192d4c17233725ca97a0acc82fa2c2a","\\x0f363534333231","\\x1b3031322d31313131312d3032","\\x01",NULL,NULL,"\\x00",NULL,NULL,NULL,"\\xb5986ed88dc80200","\\x1d62756b6174796e6963686f6c61",NULL,NULL}
(2 rows)




COMMENT table:

PRIMARY:

issuedb=#  select ctid, xmin, xmax, comment_id from comment;
 ctid  | xmin | xmax |  comment_id
---+--+--+--
 (0,1) | 3448 |0 | 0192d528-f0f8-7ab9-b709-035bf31b5a0c
 (0,2) | 3737 |0 | 019345d3-7451-7483-a8e4-ec60a4c12442
 (0,3) | 3738 |0 | 019345d4-2a00-7e18-874c-1c8b9e4f36f9
 (0,4) | 3739 |0 | 019345d5-2c5a-7aac-8d58-614272fe0e60
(4 rows)

issuedb=#   SELECT t_xmin, t_xmax, tuple_data_split('comment'::regclass,
t_data, t_infomask, t_infomask2, t_bits)
issuedb-#   FROM heap_page_items(get_raw_page('comment', 0));
 t_xmin | t_xmax |

 tuple_data_split

++-
---
   3448 |  0 |
{"\\x0192d528f0f87ab9b709035bf31b5a0c","\\x2554686973206973206120636f6d6d656e74","\\x4b30313932633535612d646533372d373064642d626263312d353337366232346130373036","\\x0b554e4954","\\x1d62756b6174796e6963686f6c61","
\\xc1acb56c8fc80200",NULL}
   3737 |  0 |
{"\\x019345d374517483a8e4ec60a4c12442","\\x395468697320697320612064696572656e7420636f6d6d656e74","\\x4b30313932633535612d646533372d373064642d626263312d353337366232346130373036","\\x0b554e4954","\\x1d62756b617
4796e6963686f6c61","\\x5a8ac68647ca0200",NULL}
   3738 |  0 |
{"\\x019345d42a007e18874c1c8b9e4f36f9","\\x3d54686973206973206120636f6d6d656e74206f6e20612072657061