Vacuum Questions

2025-05-02 Thread Leo
I have been working on AWS PostgreSQL RDS for a few years, but still not
very experienced when it comes to performance issues.  Plus RDS is slightly
different from the pure PostgreSQL.

I am trying to comprehend exactly how vacuum works.

Here is what I am trying to do.

I am purging old records from a table (500 million rows, but I am doing it
in sets of  50,000,000 with a smaller loop of 100,000).  That works just
fine.

Because of the amount of data/rows deleted, I disabled the autovacuum for
this table (I want to have control over vacuum, autovacuum does not
complete anyway due to the timeout, sizing, etc settings that I do not want
to change system wide).  I will put the autovacuum back once I am done of
course.

The issue is when I start vacuuming.  This table has 4 indexes and a PK
that I worry about.  The PK takes about 30 minutes to vacuum and two of the
indexes take about an hour each.  The problem comes in for the other 2
indexes - they take 12+ hours each to vacuum:

psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  vacuuming
"public.pc_workflowlog"

psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  launched 4 parallel vacuum
workers for index vacuuming (planned: 4)

psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  scanned index
"pc_workflowlog_pk" to remove 5000 row versions

DETAIL:  CPU: user: 191.03 s, system: 12.43 s, elapsed: 1711.22 s

psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  scanned index
"workflowlo_u_publicid_g6uqp9lkn6e8" to remove 5000 row versions

DETAIL:  CPU: user: 325.75 s, system: 19.75 s, elapsed: 2674.24 s

psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  scanned index
"workflowlo_n_workflow_2tc9k2hdtry9v" to remove 5000 row versions

DETAIL:  CPU: user: 312.17 s, system: 16.94 s, elapsed: 3097.88 s

*psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  scanned index
"workflowlo_n_frozenseti_2kjkbjgf3c6ro" to remove 5000 row versions*

*DETAIL:  CPU: user: 41187.70 s, system: 216.14 s, elapsed: 42749.36 s*

*psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  scanned index
"workflowlo_n_userid_14kqw6qdsnndw" to remove 5000 row versions *

*DETAIL:  CPU: user: 41280.66 s, system: 216.67 s, elapsed: 42832.16 s*

psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  "pc_workflowlog": removed
5000 row versions in 1129870 pages

I've increased max_parallel_maintenance_workers to 8 for the session and it
used parallel 4 (one for each index I assume) to handle it and the two
indexes were done in ~ an hour.  What I am trying to figure out is how to
force the other two large indexes to be vacuumed in parallel - a few
workers going against an index.  It seems it is possible to do, the index
size is large enough to kick in, but I have not been able to figure it out
yet.  Most of the parameters are at default values.

What am I missing?

I have a few other questions.  Does vacuum time depend on the number of
dead rows only and the size of the table, or does the entire storage
allocation (including dead tuples) also affect it?

Would it be more beneficial to drop the two large indexes, purge, vacuum,
and recreate the indexes after make more sense (I know it needs to be
tested)?  The reason I am doing it in stages is to make sure I have enough
time to vacuum, but maybe it would not take much longer to vacuum after the
complete purge?

Lastly, is it better to delete all the rows (500 mil) instead of doing it
in smaller batches, and vacuum only once?

The current size of the table is about 1T and the indexes add another 1.5T
to it.

Truncate is not an option as I am only deleting rows older than 6 months.
Client was not doing purging for years, but will do it after the clean up.

P.S. This is my very first post here, please advise if it is the wrong
channel.  Thank you in advance.


Re: Vacuum Questions

2025-05-02 Thread John Naylor
On Fri, May 2, 2025 at 9:23 PM Leo  wrote:

> I am purging old records from a table (500 million rows, but I am doing it in 
> sets of  50,000,000 with a smaller loop of 100,000).  That works just fine.
>
> Because of the amount of data/rows deleted, I disabled the autovacuum for 
> this table (I want to have control over vacuum, autovacuum does not complete 
> anyway due to the timeout, sizing, etc settings that I do not want to change 
> system wide).  I will put the autovacuum back once I am done of course.
>
> The issue is when I start vacuuming.  This table has 4 indexes and a PK that 
> I worry about.  The PK takes about 30 minutes to vacuum and two of the 
> indexes take about an hour each.  The problem comes in for the other 2 
> indexes - they take 12+ hours each to vacuum:

Do you know offhand anything special about these two? Do they have a
random key like UUID or a hash calculation? That can make index
vacuuming slower, but 12x still seems abnormal to me.

> I've increased max_parallel_maintenance_workers to 8 for the session and it 
> used parallel 4 (one for each index I assume) to handle it and the two 
> indexes were done in ~ an hour.

Right, one worker scans one index, simultaneously with other workers.
Here the leader process launched 4 workers and also vacuumed one index
itself.

> What I am trying to figure out is how to force the other two large indexes to 
> be vacuumed in parallel - a few workers going against an index.  It seems it 
> is possible to do, the index size is large enough to kick in, but I have not 
> been able to figure it out yet.  Most of the parameters are at default values.

It is not possible to run multiple workers on a single index.

> I have a few other questions.  Does vacuum time depend on the number of dead 
> rows only and the size of the table, or does the entire storage allocation 
> (including dead tuples) also affect it?

The vacuum time depends largely on
1) The number of heap (=table) and index pages that are written to by
vacuum, since that correlates with WAL volume -- the DELETE query will
affect how the partial deletes are spread across the table. You only
want to delete records once for each page.
2) The number and size of the indexes, since they must be scanned in
their entirety
3) The storage allocation -- What version is this and what is
maintenance_work_mem set to? This affects how many times each index
must be vacuumed.

> Would it be more beneficial to drop the two large indexes, purge, vacuum, and 
> recreate the indexes after make more sense (I know it needs to be tested)?  
> The reason I am doing it in stages is to make sure I have enough time to 
> vacuum, but maybe it would not take much longer to vacuum after the complete 
> purge?

Could you explain your operational constraints? In my experience, WAL
volume has been a more pressing concern for clients in these cases
than elapsed time, but your needs may vary. One possible concern is
that the indexes that take a long time to vacuum may also take a long
time to recreate. Aside from that, dropping and recreating indexes may
be good way to speed this up, depending on how critical they are for
queries.

> Lastly, is it better to delete all the rows (500 mil) instead of doing it in 
> smaller batches, and vacuum only once?

On PG16 and earlier, this would still require each index to be
vacuumed 3 times due to memory allocation constraints, so would be
similar to 3 batches. The drop/recreate strategy would still help in
that case.

> The current size of the table is about 1T and the indexes add another 1.5T to 
> it.
>
> Truncate is not an option as I am only deleting rows older than 6 months.  
> Client was not doing purging for years, but will do it after the clean up.

What percentage of the table does the 500 million deletes represent?
If you're deleting the vast majority of the records and have the extra
disk space to spare, you might consider VACUUM FULL after completing
the deletes -- that would rewrite the table and recreate all indexes.
That's not generally recommended, since that locks the table for the
entire duration and has other disadvantages, but it is an option.

For your follow-up question:

> Also, is there a way to estimate the vacuum execution?  Something like 
> explain plan - without actually vacuuming, just to see how it will perform it 
> - like a degree of parallelism?

There is no way to ask the system to estimate the runtime or other
resource usage.

Also for future reference, please note that we discourage top-posting
(quoting an entire message in a reply).

--
John Naylor
Amazon Web Services




Re: Vacuum Questions

2025-05-02 Thread Leo
Also, is there a way to estimate the vacuum execution?  Something like
explain plan - without actually vacuuming, just to see how it will perform
it - like a degree of parallelism?

On Fri, May 2, 2025 at 10:23 AM Leo  wrote:

> I have been working on AWS PostgreSQL RDS for a few years, but still not
> very experienced when it comes to performance issues.  Plus RDS is slightly
> different from the pure PostgreSQL.
>
> I am trying to comprehend exactly how vacuum works.
>
> Here is what I am trying to do.
>
> I am purging old records from a table (500 million rows, but I am doing it
> in sets of  50,000,000 with a smaller loop of 100,000).  That works just
> fine.
>
> Because of the amount of data/rows deleted, I disabled the autovacuum for
> this table (I want to have control over vacuum, autovacuum does not
> complete anyway due to the timeout, sizing, etc settings that I do not want
> to change system wide).  I will put the autovacuum back once I am done of
> course.
>
> The issue is when I start vacuuming.  This table has 4 indexes and a PK
> that I worry about.  The PK takes about 30 minutes to vacuum and two of the
> indexes take about an hour each.  The problem comes in for the other 2
> indexes - they take 12+ hours each to vacuum:
>
> psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  vacuuming
> "public.pc_workflowlog"
>
> psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  launched 4 parallel vacuum
> workers for index vacuuming (planned: 4)
>
> psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  scanned index
> "pc_workflowlog_pk" to remove 5000 row versions
>
> DETAIL:  CPU: user: 191.03 s, system: 12.43 s, elapsed: 1711.22 s
>
> psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  scanned index
> "workflowlo_u_publicid_g6uqp9lkn6e8" to remove 5000 row versions
>
> DETAIL:  CPU: user: 325.75 s, system: 19.75 s, elapsed: 2674.24 s
>
> psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  scanned index
> "workflowlo_n_workflow_2tc9k2hdtry9v" to remove 5000 row versions
>
> DETAIL:  CPU: user: 312.17 s, system: 16.94 s, elapsed: 3097.88 s
>
> *psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  scanned index
> "workflowlo_n_frozenseti_2kjkbjgf3c6ro" to remove 5000 row versions*
>
> *DETAIL:  CPU: user: 41187.70 s, system: 216.14 s, elapsed: 42749.36 s*
>
> *psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  scanned index
> "workflowlo_n_userid_14kqw6qdsnndw" to remove 5000 row versions *
>
> *DETAIL:  CPU: user: 41280.66 s, system: 216.67 s, elapsed: 42832.16 s*
>
> psql:/home/backup/leo/fws_vacuum.sql:6: INFO:  "pc_workflowlog": removed
> 5000 row versions in 1129870 pages
>
> I've increased max_parallel_maintenance_workers to 8 for the session and
> it used parallel 4 (one for each index I assume) to handle it and the two
> indexes were done in ~ an hour.  What I am trying to figure out is how to
> force the other two large indexes to be vacuumed in parallel - a few
> workers going against an index.  It seems it is possible to do, the index
> size is large enough to kick in, but I have not been able to figure it out
> yet.  Most of the parameters are at default values.
>
> What am I missing?
>
> I have a few other questions.  Does vacuum time depend on the number of
> dead rows only and the size of the table, or does the entire storage
> allocation (including dead tuples) also affect it?
>
> Would it be more beneficial to drop the two large indexes, purge, vacuum,
> and recreate the indexes after make more sense (I know it needs to be
> tested)?  The reason I am doing it in stages is to make sure I have enough
> time to vacuum, but maybe it would not take much longer to vacuum after the
> complete purge?
>
> Lastly, is it better to delete all the rows (500 mil) instead of doing it
> in smaller batches, and vacuum only once?
>
> The current size of the table is about 1T and the indexes add another 1.5T
> to it.
>
> Truncate is not an option as I am only deleting rows older than 6 months.
> Client was not doing purging for years, but will do it after the clean up.
>
> P.S. This is my very first post here, please advise if it is the wrong
> channel.  Thank you in advance.
>