Thoughts on how to avoid a massive integer update.

2020-05-04 Thread Fehrle, Brian
Hi all,

This is a shot in the dark in hopes to find a magic bullet to fix an issue I 
have, I can’t personally think of any solution myself.

I have a database with hundreds of terabytes of data, where every table has an 
integer column referencing a small table. For reasons out of my control and 
cannot change, I NEED to update every single row in all these tables, changing 
the integer value to a different integer.

Since I have to deal with dead space, I can only do a couple tables at a time, 
then do a vacuum full after each one.
Another option is to build a new table with the new values, then drop the old 
one and swap in the new, either way is very time consuming.

Initial tests suggest this effort will take several months to complete, not to 
mention cause blocking issues on tables being worked on.

Does anyone have any hackery ideas on how to achieve this in less time? I was 
looking at possibly converting the integer column type to another that would 
present the integer differently, like a hex value, but everything still ends up 
requiring all data to be re-written to disk. In a well designed database (I 
didn’t design it :) ), I would simply change the data in the referenced table 
(200 total rows), however the key being referenced isn’t just an arbitrary ID, 
it’s actual ‘data’, and must be changed.

Thanks for any thoughts or ideas,

  *   Brian F


Re: Thoughts on how to avoid a massive integer update.

2020-05-08 Thread Fehrle, Brian


On 5/4/20, 3:56 PM, "Adrian Klaver"  wrote:

[External Email]


On 5/4/20 2:32 PM, Fehrle, Brian wrote:
> Hi all,
>
> This is a shot in the dark in hopes to find a magic bullet to fix an
> issue I have, I can’t personally think of any solution myself.
>
> I have a database with hundreds of terabytes of data, where every table
> has an integer column referencing a small table. For reasons out of my
> control and cannot change, I NEED to update every single row in all
> these tables, changing the integer value to a different integer.
>
> Since I have to deal with dead space, I can only do a couple tables at a
> time, then do a vacuum full after each one.

Why?
A regular vacuum would mark the space as available.

A regular vacuum would mark the space as available – ***for re-use***, it will 
not release the space back to the drive as ‘unused’.  99% of my tables are old 
data that will not receive any future inserts or updates, which means that 
space that is marked ready for ‘reuse’ will not ever be used. 
This means that a 100GB table will be updated, and every row marked as dead and 
re-created with the newly updated data. This table is now 200GB in size. A 
vacuum will keep it at 200GB of space used, freeing up the 100GB of dead space 
as ready-to-reuse. A vacuum full will make it 100GB again. 

Since 99% of my tables will never be updated or inserted into again, this means 
my ~300 Terabytes of data would be ~600 Terabytes of data on disk. Thus, vacuum 
full. 

More below.

> Another option is to build a new table with the new values, then drop
> the old one and swap in the new, either way is very time consuming.
>
> Initial tests suggest this effort will take several months to complete,
> not to mention cause blocking issues on tables being worked on.
>
> Does anyone have any hackery ideas on how to achieve this in less time?
> I was looking at possibly converting the integer column type to another
> that would present the integer differently, like a hex value, but
> everything still ends up requiring all data to be re-written to disk. In
> a well designed database (I didn’t design it :) ), I would simply change
> the data in the referenced table (200 total rows), however the key being
> referenced isn’t just an arbitrary ID, it’s actual ‘data’, and must be
> changed.

I'm not following above.

Could you show an example table relationship?

It’s a simple one-to-many relationship:
*Info_table*
info_table_sid integer


*data_table*
data_table_sid integer,
info_table_id integer references info_table(info_table_sid),




>
> Thanks for any thoughts or ideas,
>
>   * Brian F
>


--
Adrian Klaver
adrian.kla...@aklaver.com




Re: Thoughts on how to avoid a massive integer update.

2020-05-08 Thread Fehrle, Brian


From: "David G. Johnston" 
Date: Friday, May 8, 2020 at 11:48 AM
To: "Fehrle, Brian" 
Cc: "pgsql-gene...@postgresql.org" 
Subject: Re: Thoughts on how to avoid a massive integer update.


[External Email]
On Mon, May 4, 2020 at 2:33 PM Fehrle, Brian 
mailto:bfeh...@comscore.com>> wrote:
I NEED to update every single row in all these tables, changing the integer 
value to a different integer.

Does anyone have any hackery ideas on how to achieve this in less time?

Probably the only solution that would perform computationally faster would take 
the same amount of time or more to code and debug, and be considerably riskier. 
 Basically shut down PostgreSQL and modify the data files directly to change 
one integer byte sequence to another.  On the positive side the source code for 
PostgreSQL is open source and that data, while complex, is structured.

On the point of "vacuum" versus "vacuum full" - I don't know if this is how it 
would work in reality but conceptually if you updated half the table, vacuumed, 
updated the second half, vacuumed, the second batch of updates would reuse the 
spaced freed from the first batch and you'd only increase the disk consumption 
by 1.5 instead of 2.0.  As you increase the number of batches the percentage of 
additional space consumed decreases.  Though if you have the space I'd have to 
imagine that creating a brand new table and dropping the old one would be the 
best solution when taken in isolation.

David J.


Modifying data files is too risky and I wouldn’t be able to get that kind of 
work approved.

Even with keeping excess space to an additional 50%, that’s tons of storage I’d 
need to order, so either vacuum full or re-create tables for minimal on disk 
usage are my only options.


Re: Thoughts on how to avoid a massive integer update.

2020-05-08 Thread Fehrle, Brian


From: Rob Sargent 
Date: Friday, May 8, 2020 at 11:05 AM
To: "Fehrle, Brian" 
Cc: Adrian Klaver , "pgsql-gene...@postgresql.org" 

Subject: Re: Thoughts on how to avoid a massive integer update.


[External Email]
   Could you show an example table relationship?

It’s a simple one-to-many relationship:
*Info_table*
info_table_sid integer


*data_table*
data_table_sid integer,
info_table_id integer references info_table(info_table_sid),




Right, and now you wish to change the values in the referenced table 
(info_table.info_table_sid) correct?


Correct. If info_table.info_table_sid = 123 and many rows in data_table point 
to it, the ID needs to be changed to 456 in the info_table, as well as all the 
columns in the data_table.


Re: Thoughts on how to avoid a massive integer update.

2020-05-08 Thread Fehrle, Brian
David’s assessment is correct (and I think we’re all on the same page). The 
value of the foreign keys that tie the tables together must be changed, and 
yeah that value _should_ simply be an additional column in the info_table and 
the foreign key be an arbitrary integer, but since it wasn’t set up that way 
from the beginning (over a decade ago), this is what I’m stuck with.

Blah.


From: Rob Sargent 
Date: Friday, May 8, 2020 at 3:05 PM
To: "David G. Johnston" 
Cc: "Fehrle, Brian" , "pgsql-gene...@postgresql.org" 

Subject: Re: Thoughts on how to avoid a massive integer update.


[External Email]



On May 8, 2020, at 2:57 PM, David G. Johnston 
mailto:david.g.johns...@gmail.com>> wrote:

On Fri, May 8, 2020 at 1:51 PM Rob Sargent 
mailto:robjsarg...@gmail.com>> wrote:

On May 8, 2020, at 2:43 PM, David G. Johnston 
mailto:david.g.johns...@gmail.com>> wrote:

On Fri, May 8, 2020 at 1:41 PM Rob Sargent 
mailto:robjsarg...@gmail.com>> wrote:
My understanding is the keys in the info_table need to change.  That causes the 
very expensive update in the update in the data tables. No?

The keys in the info_table need to change because their contents are no longer 
legal to be stored (OP has not specified but think using an integer value of 
someones social security number as a key).  The FK side of the relationship 
equality has the same illegal data values problem and need to be changed too.

Wow, I couldn’t disagree more ;)

Your agreement or disagreement with the problem statement is immaterial here - 
the OP has stated what the requirement, for which I have made a simplistic 
analogy in order to try and get the point across to you.  As the OP has said it 
is a poor design - and now it is being corrected.  The request is whether there 
is some way to do so better than the two options the OP already described.

David J.

Sorry, I wasn’t disagreeing with the problem statement. OP did say the 
“info.id<https://linkprotect.cudasvc.com/url?a=http%3a%2f%2finfo.id&c=E,1,l7B8bw8isNYaTDkm2_hIVb79FGTulxe9Tia8l_UH_XSHi2D5lYB_8XDLez1wLFLAJRgh9Pmyu4VZJSklgkgItDzOjCQxP-MtImoIUALMbg,,&typo=1>”
 needed to change from 123 to 456.  With the current foreign key alignment that 
is very expensive.  I think we’re all in agreement there.  To push “456” back 
out to the data table I see as perpetuation of the problem.  I didn’t sense 
that OP felt it necessary to continue in the current mode as a requirement.  If 
so, my mistake