Partitioning and unique key

2024-08-31 Thread veem v
Hello,

We have our transaction tables daily range partitioned based on
transaction_timestamp column which is timestamptz data type and these are
having composite primary key on (transaction_id, transaction_timestamp).
And we were using an "insert on conflict" for loading data to our system ,
which means if another record comes to the system with the same
transaction_id and transaction_timestamp, it will get updated. This way we
already have 60 days worth of data stored in our system with approx. 70
million transactions per day.

But we just got to know from business that the data should be unique by
only transaction_id but not transaction_timestamp. Any incoming data with
the same transaction_id(even different transaction_timestamp) should get
updated but not inserted.

Also these daily partitions are going to hold 400million rows in future and
will be queried on the transaction_timestamp filter so we can't really
avoid the partitioning option here considering future growth.

But due to postgres limitations we are unable to have this unique
constraint or primary key only on the transaction_id column, we have to
include transaction_timestamp with it as a composite key. So I want to
understand from experts if there is any possible way to satisfy both
partitioning on transaction_timestamp column and unique key or pk just on
*trans*action_id only?

Note-its 15.4 postgres database.


Regards

Veem


Re: Upgrade Ubuntu 22 -> 24 may break PostgreSQL

2024-08-31 Thread Adrian Klaver

On 8/31/24 09:54, Peter J. Holzer wrote:

'Tis the season again.

Ubuntu 24.04.1 has just been released, so many Ubuntu LTS users will now
be prompted to upgrade from 22.04 to 24.04.


Which I ignore.



A word of warning to those who use Postgresql from the Ubuntu repo (not
PGDG):


Why I do use the PGDG repo's.



As usual, a newer Ubuntu version comes with a newer Postgres version (16
instead of 14). Also as usual, I got a message during the upgrade that
Postgres 14 is obsolete,. but the binaries have been left installed and
I should upgrade to Postgres 16 manually ASAP.

But after the reboot, PostgreSQL failed to start because it needed a
shared library (libldap) which was no longer there. So a normal
pg_upgradecluster wouldn't work.


sudo apt install libldap-X.x did not work?



In my case the quickest way to recover was to install postgresql-14 on a
VM, copy the data direcory into that instance and make a fresh dump,
then install postgresql-16 on my laptop and restore the dump. Annoying,
but no big deal for the small test database I keep on my laptop. If you
have multi-terabyte databases, your situation may be different.

I'm not exactly sure what went wrong (I got some conflicts during the
upgrade and maybe I shouldn't have invoked apt autoremove?), and you may
not have this problem, but make sure you have a backup before the
upgrade.

 hp




--
Adrian Klaver
adrian.kla...@aklaver.com





Re: Partitioning and unique key

2024-08-31 Thread David G. Johnston
On Saturday, August 31, 2024, veem v  wrote:

>
> 1) if it's technically possible to have a unique key on only the
> transaction_id column having the partition key on the
> transaction_timestamp, because the table is going to be queried/purged
> based on the transaction_timestamp?
>

There is presently no such thing as a cross-partition unique constraint.
If you define the constraint on the [partitioned] table the documentation
is perfectly clear, as are I believe the error messages, that it will
require all partitioning columns to be included - since that is what
happens in reality.  If you target the partitions directly with the unique
index or constraint no such limitation should exist.


>
> iii)And then alter the datatype of the partition key transaction_date to
> DATE in one shot at the table level(which should be fast as its having more
> granularity as compare to existing timestamptype, so should be catalog or
> dictionary change only), and that will remain the part of composite PK
> (transaction_id,transaction_date).
>

While this might seem logical, in reality date and timestamptz are
different fixed-width data types and thus any attempt to change from one to
the other will involve a table rewrite.  Best you could do is leave the
timestamptz in place and just truncate to day so the time is always
midnight UTC.


> iv) Repeat this step for all child partition tables and then for the
> parent partition tables.
>

I’d suggest trying to just build a new partitioned table that is correctly
defined. Then populate it.  Add a trigger to the existing one to keep the
new one in sync.  Then change your application code to point to the new
partitioned table.  At which point the old partitioned table can be dropped.

David J.


Re: Upgrade Ubuntu 22 -> 24 may break PostgreSQL

2024-08-31 Thread Justin Clift

On 2024-09-01 02:54, Peter J. Holzer wrote:

'Tis the season again.

Ubuntu 24.04.1 has just been released, so many Ubuntu LTS users will 
now

be prompted to upgrade from 22.04 to 24.04.

A word of warning to those who use Postgresql from the Ubuntu repo (not
PGDG):

As usual, a newer Ubuntu version comes with a newer Postgres version 
(16

instead of 14). Also as usual, I got a message during the upgrade that
Postgres 14 is obsolete,. but the binaries have been left installed and
I should upgrade to Postgres 16 manually ASAP.


It'd *technically* be possible to automatically run an upgrade of the
PostgreSQL repository (via scripting?) at launch time, though just 
blindly

doing it for everyone would be a *major* change of behaviour.

Some people would likely love it, while others would be horrified (etc).

That being said, if we announce it ahead of time as a feature of a major
release (ie PG 18 or something), and if we have a clear way to not
automatically upgrade (a variable in postgresql.conf?), then we might be
able to solve this problem ~permanently.

We'd also need to figure out how to handle (say) rebuilding of indexes
that need updating between major versions and stuff like that.

Thoughts?

Regards and best wishes,

Justin Clift




Upgrade Ubuntu 22 -> 24 may break PostgreSQL

2024-08-31 Thread Peter J. Holzer
'Tis the season again.

Ubuntu 24.04.1 has just been released, so many Ubuntu LTS users will now
be prompted to upgrade from 22.04 to 24.04.

A word of warning to those who use Postgresql from the Ubuntu repo (not
PGDG):

As usual, a newer Ubuntu version comes with a newer Postgres version (16
instead of 14). Also as usual, I got a message during the upgrade that
Postgres 14 is obsolete,. but the binaries have been left installed and
I should upgrade to Postgres 16 manually ASAP.

But after the reboot, PostgreSQL failed to start because it needed a
shared library (libldap) which was no longer there. So a normal
pg_upgradecluster wouldn't work.

In my case the quickest way to recover was to install postgresql-14 on a
VM, copy the data direcory into that instance and make a fresh dump,
then install postgresql-16 on my laptop and restore the dump. Annoying,
but no big deal for the small test database I keep on my laptop. If you
have multi-terabyte databases, your situation may be different.

I'm not exactly sure what went wrong (I got some conflicts during the
upgrade and maybe I shouldn't have invoked apt autoremove?), and you may
not have this problem, but make sure you have a backup before the
upgrade.

hp


-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature


Re: Partitioning and unique key

2024-08-31 Thread veem v
On Sun, 1 Sept 2024 at 03:58, Adrian Klaver 
wrote:

>
> The model is at odds with itself and untenable. If the tables hold
> multiple rows for a given transaction_id  then you cannot have a
> PK/Unique constraint on that column. Seems there is a decided lack of
> any planning. The only way I can see this happening is consolidating all
> the duplicate transaction_id rows into a single row for each
> transaction_id. That then leads to the question of how to do that and
> retain the 'correct' information from the selection of rows for each
> transaction_id.
>
>
Yes we had messed up the data for now and have multiple records for each
transaction_id persisted and thus we need to fix the data. But more than
that , as I stated , I wanted to understand first
1) if it's technically possible to have a unique key on only the
transaction_id column having the partition key on the
transaction_timestamp, because the table is going to be queried/purged
based on the transaction_timestamp?

2) Additionally we were thinking if above is technically not possible, then
the maximum granularity which we can have for each transaction_id will be a
day, so the partition key transaction_timestmp can be truncated to have
only date component but no time component. So the primary key will
be (transaction_id, transaction_date). But we also don't want to lose the
time component and persist the existing data of transaction_timestmp (which
will have a time component in it, in a separate column).

And in above case , for fixing the existing data in least disruptive way,
as we have currently duplicate transaction_id inserted into the table
already because of the composite primary key(transaction_id,
transaction_timestmp).Can we simply
 i)rename the existing column transaction_timestmp to transaction_date and
then add new column transaction_timestmp using the values of existing
column partition by partition.
ii)And then delete the duplicate data using query something as below , each
partition by partition.
iii)And then alter the datatype of the partition key transaction_date to
DATE in one shot at the table level(which should be fast as its having more
granularity as compare to existing timestamptype, so should be catalog or
dictionary change only), and that will remain the part of composite PK
(transaction_id,transaction_date).
iv) Repeat this step for all child partition tables and then for the parent
partition tables.

Will this technique be the most efficient way of fixing this mess?

WITH ranked_records AS (
SELECT column1_id, column2_timestamptz,
   ROW_NUMBER() OVER (PARTITION BY column1_id,
date_trunc('day', column2_timestamptz)
  ORDER BY column2_timestamptz DESC) AS
rn
FROM partition_name
)
DELETE FROM partition_name T1
WHERE EXISTS (
SELECT 1
FROM ranked_records T2
WHERE T1.column1_id = T2.column1_id
  AND T1.column2_timestamptz = T2.column2_timestamptz
  AND T2.rn > 1
)


Re: Partitioning and unique key

2024-08-31 Thread veem v
On Sun, 1 Sept 2024 at 09:13, David G. Johnston 
wrote:

> On Saturday, August 31, 2024, veem v  wrote:
>
>>
>>
>> iii)And then alter the datatype of the partition key transaction_date to
>> DATE in one shot at the table level(which should be fast as its having more
>> granularity as compare to existing timestamptype, so should be catalog or
>> dictionary change only), and that will remain the part of composite PK
>> (transaction_id,transaction_date).
>>
>
> While this might seem logical, in reality date and timestamptz are
> different fixed-width data types and thus any attempt to change from one to
> the other will involve a table rewrite.  Best you could do is leave the
> timestamptz in place and just truncate to day so the time is always
> midnight UTC.
>
>
>>
>>
Here  , if we keep the PK column as is i.e. the transaction_timestamp as
timestamptz but truncate the time component , in that case again in future
if someone tries to insert(using insert on conflict) data into the table
with time component , it will get consumed and will not be restricted by
the PK constraint. So I was trying to make the data type also as DATE for
the transaction_timestap column.

As in this case anyway we have to create another column to populate the
date+timestamp values as we cant throw those values away per business need,
so we will be kind of rewriting the table.So is it okay if if we will

1) Detach all the partitions.
2)Do the alter using "only" key word in table level. (For adding new column
transaction_timestamp_new  to hold date+timestamp value and also altering
the existing transaction_timestamp column to DATE from type timestamptz).
3)Then do the data fix(delete the duplicates) and alter the column, one
partition at a time for all of the partitions and once done , attach  those
partitions one by one.
5)Rename the columns at table level.Hope this won't need any table rewrite.

Is there any downside if we go by the above approach?


Re: Partitioning and unique key

2024-08-31 Thread Adrian Klaver

On 8/31/24 13:02, veem v wrote:

Hello,

We have our transaction tables daily range partitioned based on 
transaction_timestamp column which is timestamptz data type and these 
are having composite primary key on (transaction_id, 
transaction_timestamp). And we were using an "insert on conflict" for 
loading data to our system , which means if another record comes to the 
system with the same transaction_id and transaction_timestamp, it will 
get updated. This way we already have 60 days worth of data stored in 
our system with approx. 70 million transactions per day.


But we just got to know from business thatthe data should be unique by 
only transaction_id but not transaction_timestamp. Any incoming data 
with the same transaction_id(even different transaction_timestamp) 
should get updated but not inserted.


Also these daily partitions are going to hold 400million rows in future 
and will be queried on the transaction_timestamp filter so we can't 
really avoid the partitioning option here considering future growth.


But due to postgres limitations we are unable to have this unique 
constraint or primary key only on the transaction_id column, we have to 
include transaction_timestamp with it as a composite key. So I want to 
understand from experts if there is any possible way to satisfy both 
partitioning on transaction_timestamp column and unique key or pk just 
on _trans_action_id only?


The model is at odds with itself and untenable. If the tables hold 
multiple rows for a given transaction_id  then you cannot have a 
PK/Unique constraint on that column. Seems there is a decided lack of 
any planning. The only way I can see this happening is consolidating all 
the duplicate transaction_id rows into a single row for each 
transaction_id. That then leads to the question of how to do that and 
retain the 'correct' information from the selection of rows for each 
transaction_id.




Note-its 15.4 postgres database.


Regards

Veem



--
Adrian Klaver
adrian.kla...@aklaver.com





Re: Partitioning and unique key

2024-08-31 Thread veem v
On Sun, 1 Sept 2024 at 10:03, veem v  wrote:

>
> On Sun, 1 Sept 2024 at 09:13, David G. Johnston <
> david.g.johns...@gmail.com> wrote:
>
>> On Saturday, August 31, 2024, veem v  wrote:
>>
>>>
>>>
>>> iii)And then alter the datatype of the partition key transaction_date to
>>> DATE in one shot at the table level(which should be fast as its having more
>>> granularity as compare to existing timestamptype, so should be catalog or
>>> dictionary change only), and that will remain the part of composite PK
>>> (transaction_id,transaction_date).
>>>
>>
>> While this might seem logical, in reality date and timestamptz are
>> different fixed-width data types and thus any attempt to change from one to
>> the other will involve a table rewrite.  Best you could do is leave the
>> timestamptz in place and just truncate to day so the time is always
>> midnight UTC.
>>
>>
>>>
>>>
> Here  , if we keep the PK column as is i.e. the transaction_timestamp as
> timestamptz but truncate the time component , in that case again in future
> if someone tries to insert(using insert on conflict) data into the table
> with time component , it will get consumed and will not be restricted by
> the PK constraint. So I was trying to make the data type also as DATE for
> the transaction_timestap column.
>
> As in this case anyway we have to create another column to populate the
> date+timestamp values as we cant throw those values away per business need,
> so we will be kind of rewriting the table.So is it okay if if we will
>
> 1) Detach all the partitions.
> 2)Do the alter using "only" key word in table level. (For adding new
> column transaction_timestamp_new  to hold date+timestamp value and also
> altering the existing transaction_timestamp column to DATE from type
> timestamptz).
> 3)Then do the data fix(delete the duplicates) and alter the column, one
> partition at a time for all of the partitions and once done , attach  those
> partitions one by one.
> 5)Rename the columns at table level.Hope this won't need any table rewrite.
>
> Is there any downside if we go by the above approach?
>

Or do you mean to say there is no way we can modify the data type of a
partition key even by detaching the partitions one by one? And thus we may
have only way left is to create the table from scratch with partitions and
populate the data to it? I was avoiding this because we have many indexes
also in it , so creating from scratch means creating those indexes again.
So I wanted to achieve it by detaching partitions, doing the required
change and attaching it again.


Re: Upgrade Ubuntu 22 -> 24 may break PostgreSQL

2024-08-31 Thread Peter J. Holzer
On 2024-08-31 10:35:01 -0700, Adrian Klaver wrote:
> On 8/31/24 09:54, Peter J. Holzer wrote:
> > 'Tis the season again.
> > 
> > Ubuntu 24.04.1 has just been released, so many Ubuntu LTS users will now
> > be prompted to upgrade from 22.04 to 24.04.
> 
> Which I ignore.

At some point I do want to upgrade. Not necessarily at the earliest
oportunity, but for workstations/laptops I'm usually rather quick. This
laptop was originally delivered with a pre-installed Ubuntu 18.04 LTS,
and has since been upgraded to 20.04, 22.04 and now 24.04. Servers are
different. They are usually upgraded when neccessary. Or sometimes new
hardware is ready before support ends ...

But I think our personal preferences are besides the point. Many people
will upgrade Ubuntu in the next weeks or months. So I wanted to give
them a heads-up that this might not be as smooth as expected.


> > A word of warning to those who use Postgresql from the Ubuntu repo (not
> > PGDG):
> 
> Why I do use the PGDG repo's.

So do I on production systems. But this was just my personal laptop.
But yeah, it reminded me that one of the reasons I prefer the PGDG repos
is that they decouple PostgreSQL upgrades from OS upgrades.


> > As usual, a newer Ubuntu version comes with a newer Postgres version (16
> > instead of 14). Also as usual, I got a message during the upgrade that
> > Postgres 14 is obsolete,. but the binaries have been left installed and
> > I should upgrade to Postgres 16 manually ASAP.
> > 
> > But after the reboot, PostgreSQL failed to start because it needed a
> > shared library (libldap) which was no longer there. So a normal
> > pg_upgradecluster wouldn't work.
> 
> sudo apt install libldap-X.x did not work?

It might have worked after reenabling the Ubuntu 22 repo. Or it might
have worked to simply symlink to the newer libldap. Frankly I didn't
try. I had a VM with ubuntu 22 without postgres lying around, so that
seemed like the safeste route.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature