Re: Upgrade Ubuntu 22 -> 24 may break PostgreSQL

2024-09-01 Thread Adrian Klaver

On 8/31/24 19:32, Justin Clift wrote:

On 2024-09-01 02:54, Peter J. Holzer wrote:

'Tis the season again.



As usual, a newer Ubuntu version comes with a newer Postgres version (16
instead of 14). Also as usual, I got a message during the upgrade that
Postgres 14 is obsolete,. but the binaries have been left installed and
I should upgrade to Postgres 16 manually ASAP.


It'd *technically* be possible to automatically run an upgrade of the
PostgreSQL repository (via scripting?) at launch time, though just blindly
doing it for everyone would be a *major* change of behaviour.


The OP was using the Ubuntu repo and for a given major version of Ubuntu 
that is pinned to a given major version of Postgres. I am not seeing 
changing that would go over well. Now if a user is using the PGDG repo's 
that is a different story, then you point at the repo's for the new 
Ubuntu version do an update/upgrade and you are back on track.



Thoughts?


This is something the end user needs to work out ahead of time as there 
is an overhead in the process they need to take into consideration. 
Cranking up a new version of Ubuntu/Debian and have it take off doing 
things behind the scenes to the Postgres instance(s) would disturb me.




Regards and best wishes,

Justin Clift




--
Adrian Klaver
adrian.kla...@aklaver.com





Re: Upgrade Ubuntu 22 -> 24 may break PostgreSQL

2024-09-01 Thread Adrian Klaver

On 8/31/24 11:02, Peter J. Holzer wrote:

On 2024-08-31 10:35:01 -0700, Adrian Klaver wrote:

On 8/31/24 09:54, Peter J. Holzer wrote:

'Tis the season again.

Ubuntu 24.04.1 has just been released, so many Ubuntu LTS users will now
be prompted to upgrade from 22.04 to 24.04.


Which I ignore.



But I think our personal preferences are besides the point. Many people
will upgrade Ubuntu in the next weeks or months. So I wanted to give
them a heads-up that this might not be as smooth as expected.


Major updates rarely are. That is why I tend to look before I leap. One 
way to make the leap easier is to change the Postgres repo from the 
Ubuntu to the PGDG repo on the current Ubuntu version being run then do 
the Ubuntu upgrade. The caveat being what the current version of 
Ubuntu/Postgres is you are running and whether that Postgres version is 
still supported in the main PGDG repo. Otherwise you are going to:


https://apt-archive.postgresql.org/




--
Adrian Klaver
adrian.kla...@aklaver.com





Re: Postgres Logical Replication - how to see what subscriber is doing with received data?

2024-09-01 Thread Shaheed Haque
Since nobody more knowledgeable has replied...

I'm very interested in this area and still surprised that there is no
official/convenient/standard way to approach this (see
https://www.postgresql.org/message-id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5WXCN9KjZOvuTNJaAAC_hg%40mail.gmail.com
).

Based partly on that thread, I ended up with a script that connects to both
ends of the replication, and basically loops while comparing the counts in
each table.

On Fri, 30 Aug 2024, 12:38 Michael Jaskiewicz,  wrote:

> I've got two Postgres 13 databases on AWS RDS.
>
>- One is a master, the other a slave using logical replication.
>- Replication has fallen behind by about 350Gb.
>- The slave was maxed out in terms of CPU for the past four days
>because of some jobs that were ongoing so I'm not sure what logical
>replication was able to replicate during that time.
>- I killed those jobs and now CPU on the master and slave are both low.
>- I look at the subscriber via `select * from pg_stat_subscription;`
>and see that latest_end_lsn is advancing albeit very slowly.
>- The publisher says write/flush/replay lags are all 13 minutes behind
>but it's been like that for most of the day.
>- I see no errors in the logs on either the publisher or subscriber
>outside of some simple SQL errors that users have been making.
>- CloudWatch reports low CPU utilization, low I/O, and low network.
>
>
>
> Is there anything I can do here? Previously I set wal_receiver_timeout
> timeout to 0 because I had replication issues, and that helped things. I
> wish I had *some* visibility here to get any kind of confidence that it's
> going to pull through, but other than these lsn values and database logs,
> I'm not sure what to check.
>
>
>
> Sincerely,
>
> mj
>


Re: Postgres Logical Replication - how to see what subscriber is doing with received data?

2024-09-01 Thread Muhammad Ikram
Hi Shaheed,

Maybe these considerations could help you or give any hint to the problem ?


Check if wal_receiver_timeout being set to 0 could potentially cause
issues, like not detecting network issues quickly enough. Consider
re-evaluating this setting if you see connection issues.

If you notice that some data is missing on subscriber then could you
increase max_slot_wal_keep_size on publisher so that WALs are not deleted
until they are applied on subscriber.

Do you have flexibility to increase max_worker_processes and
max_logical_replication_workers, work_mem and maintenance_work_mem on
subscriber (In case bottleneck exists on subscriber)

If there's significant lag, consider whether it might be more efficient to
drop the subscription and re-initialize it from scratch using a new base
backup, depending on the data volume and how long it might take for the
existing replication to catch up.


 Regards,
Muhammad Ikram


On Sun, Sep 1, 2024 at 9:22 PM Shaheed Haque  wrote:

> Since nobody more knowledgeable has replied...
>
> I'm very interested in this area and still surprised that there is no
> official/convenient/standard way to approach this (see
> https://www.postgresql.org/message-id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5WXCN9KjZOvuTNJaAAC_hg%40mail.gmail.com
> ).
>
> Based partly on that thread, I ended up with a script that connects to
> both ends of the replication, and basically loops while comparing the
> counts in each table.
>
> On Fri, 30 Aug 2024, 12:38 Michael Jaskiewicz, 
> wrote:
>
>> I've got two Postgres 13 databases on AWS RDS.
>>
>>- One is a master, the other a slave using logical replication.
>>- Replication has fallen behind by about 350Gb.
>>- The slave was maxed out in terms of CPU for the past four days
>>because of some jobs that were ongoing so I'm not sure what logical
>>replication was able to replicate during that time.
>>- I killed those jobs and now CPU on the master and slave are both
>>low.
>>- I look at the subscriber via `select * from pg_stat_subscription;`
>>and see that latest_end_lsn is advancing albeit very slowly.
>>- The publisher says write/flush/replay lags are all 13 minutes
>>behind but it's been like that for most of the day.
>>- I see no errors in the logs on either the publisher or subscriber
>>outside of some simple SQL errors that users have been making.
>>- CloudWatch reports low CPU utilization, low I/O, and low network.
>>
>>
>>
>> Is there anything I can do here? Previously I set wal_receiver_timeout
>> timeout to 0 because I had replication issues, and that helped things. I
>> wish I had *some* visibility here to get any kind of confidence that
>> it's going to pull through, but other than these lsn values and database
>> logs, I'm not sure what to check.
>>
>>
>>
>> Sincerely,
>>
>> mj
>>
>

-- 
Muhammad Ikram


Re: PG17 optimizations to vacuum

2024-09-01 Thread Peter Geoghegan
On Sun, Sep 1, 2024 at 5:44 PM Pavel Luzanov  wrote:
> I see a perfectly working TID-store optimization.
> With reduced maintenance_work_mem it used only one 'vacuuming indexes'
> phase instead of 21 in v16.
> But I also expected to see a reduction in the number of WAL records
> and the total size of the WAL. Instead, WAL numbers have significantly
> degraded.
>
> What am I doing wrong?

That does seem weird.

CC'ing the authors of the relevant VACUUM enhancements.

-- 
Peter Geoghegan




Re: Partitioning and unique key

2024-09-01 Thread veem v
On Sun, 1 Sept 2024 at 11:38, veem v  wrote:

>
> On Sun, 1 Sept 2024 at 10:03, veem v  wrote:
>
>>
>> On Sun, 1 Sept 2024 at 09:13, David G. Johnston <
>> david.g.johns...@gmail.com> wrote:
>>
>>> On Saturday, August 31, 2024, veem v  wrote:
>>>


 iii)And then alter the datatype of the partition key transaction_date
 to DATE in one shot at the table level(which should be fast as its having
 more granularity as compare to existing timestamptype, so should be catalog
 or dictionary change only), and that will remain the part of composite PK
 (transaction_id,transaction_date).

>>>
>>> While this might seem logical, in reality date and timestamptz are
>>> different fixed-width data types and thus any attempt to change from one to
>>> the other will involve a table rewrite.  Best you could do is leave the
>>> timestamptz in place and just truncate to day so the time is always
>>> midnight UTC.
>>>
>>>


>> Here  , if we keep the PK column as is i.e. the transaction_timestamp as
>> timestamptz but truncate the time component , in that case again in future
>> if someone tries to insert(using insert on conflict) data into the table
>> with time component , it will get consumed and will not be restricted by
>> the PK constraint. So I was trying to make the data type also as DATE for
>> the transaction_timestap column.
>>
>> As in this case anyway we have to create another column to populate the
>> date+timestamp values as we cant throw those values away per business need,
>> so we will be kind of rewriting the table.So is it okay if if we will
>>
>> 1) Detach all the partitions.
>> 2)Do the alter using "only" key word in table level. (For adding new
>> column transaction_timestamp_new  to hold date+timestamp value and also
>> altering the existing transaction_timestamp column to DATE from type
>> timestamptz).
>> 3)Then do the data fix(delete the duplicates) and alter the column, one
>> partition at a time for all of the partitions and once done , attach  those
>> partitions one by one.
>> 5)Rename the columns at table level.Hope this won't need any table
>> rewrite.
>>
>> Is there any downside if we go by the above approach?
>>
>
> Or do you mean to say there is no way we can modify the data type of a
> partition key even by detaching the partitions one by one? And thus we may
> have only way left is to create the table from scratch with partitions and
> populate the data to it? I was avoiding this because we have many indexes
> also in it , so creating from scratch means creating those indexes again.
> So I wanted to achieve it by detaching partitions, doing the required
> change and attaching it again.
>
>
I tried  by detaching the partitions, but still then it's not allowing me
to alter the DAT TYPE of the partition key and throwing error as below.

ERROR:  cannot alter column "" because it is part of
the partition key of relation ""

Now I am thinking if it's really going to get too complex if we try to
stick with the partition detach and attach strategy. As a few teammates say
, having a new column added with just a date type and then drop the
existing FK and PK first and then detach all the partitions, and attach the
partitions back using the new DATE column. and then recreate the PK again.
Btw we have ~5 partition tables with parent child relationship on which
this fix has to be applied. So I'm still wondering the best way possible
for fixing this issue.


Could we go back in a replication slot?

2024-09-01 Thread xiong ding
Hi, community,

The START_REPLICATION command accepts a lsn parameter, but according to the
document that replication "starts at either WAL location XXX/XXX or the
slot's confirmed_flush_lsn (see Section 54.19), whichever is greater." I
wonder if there is a way to go back, like a command to set
confirmed_flush_lsn to a previous lsn.

I am asking because I may need to replay the replication messages to Kafka.

Sincerely,
Xiong Ding


PG17 optimizations to vacuum

2024-09-01 Thread Pavel Luzanov

Hello,

While playing with optimizations to vacuum in v17 I can't understand
how to measure this one:
"Allow vacuum to more efficiently remove and freeze tuples".

My test script and results:

CREATE TABLE t(id integer) WITH (autovacuum_enabled = off);
INSERT INTO t SELECT gen.id FROM generate_series(1,3_500_000) gen(id);
CREATE INDEX t_id ON t(id);
SET maintenance_work_mem = '1MB';
UPDATE t SET id = id + 1;
VACUUM FREEZE VERBOSE t;


v16.4
INFO:  aggressively vacuuming "postgres.public.t"
INFO:  finished vacuuming "postgres.public.t": index scans: 21
pages: 0 removed, 30974 remain, 30974 scanned (100.00% of total)
tuples: 350 removed, 350 remain, 0 are dead but not yet removable
removable cutoff: 1675, which was 0 XIDs old when operation ended
new relfrozenxid: 1675, which is 4 XIDs ahead of previous value
frozen: 15488 pages from table (50.00% of total) had 350 tuples frozen
index scan needed: 15487 pages from table (50.00% of total) had 350 dead 
item identifiers removed
index "t_id": pages: 19196 in total, 0 newly deleted, 0 currently deleted, 0 
reusable
avg read rate: 473.207 MB/s, avg write rate: 92.511 MB/s
buffer usage: 212718 hits, 267930 misses, 52380 dirtied
WAL usage: 96585 records, 42819 full page images, 198029405 bytes
system usage: CPU: user: 3.17 s, system: 0.48 s, elapsed: 4.42 s
VACUUM

master
INFO:  aggressively vacuuming "postgres.public.t"
INFO:  finished vacuuming "postgres.public.t": index scans: 1
pages: 0 removed, 30974 remain, 30974 scanned (100.00% of total)
tuples: 350 removed, 350 remain, 0 are dead but not yet removable
removable cutoff: 950, which was 0 XIDs old when operation ended
new relfrozenxid: 950, which is 4 XIDs ahead of previous value
frozen: 15488 pages from table (50.00% of total) had 350 tuples frozen
index scan needed: 15487 pages from table (50.00% of total) had 350 dead 
item identifiers removed
index "t_id": pages: 19196 in total, 0 newly deleted, 0 currently deleted, 0 
reusable
avg read rate: 101.121 MB/s, avg write rate: 120.530 MB/s
buffer usage: 48900 hits, 47749 reads, 56914 dirtied
WAL usage: 125391 records, 46626 full page images, 330547751 bytes
system usage: CPU: user: 2.90 s, system: 0.27 s, elapsed: 3.68 s
VACUUM


I see a perfectly working TID-store optimization.
With reduced maintenance_work_mem it used only one 'vacuuming indexes'
phase instead of 21 in v16.
But I also expected to see a reduction in the number of WAL records
and the total size of the WAL. Instead, WAL numbers have significantly
degraded.

What am I doing wrong?

--
Pavel Luzanov
Postgres Professional:https://postgrespro.com