autovacuum failing on pg_largeobject and disk usage of the pg_largeobject growing unchecked

2020-06-16 Thread Jim Hurne
We have a cloud service that uses PostgreSQL to temporarily store binary 
content. We're using PostgreSQL's Large Objects to store the binary 
content. Each large object lives anywhere from a few hundred milliseconds 
to 5-10 minutes, after which it is deleted.

Normally, this works just fine and we have no issues. However, roughly 
every 3 months or so, disk usage on our PostgreSQL database starts to 
increase at an alarming rate. More specifically, the pg_largeobject 
table's disk usage steadily increases even though we've deleted most of 
the large objects. This will continue until the database server runs out 
of disk space (and then all sorts of bad things happen of course).

It looks to us like autovacuum against the pg_largeobject table stops 
working because the pg_stat_all_tables.last_autovacuum column stops 
updating, or is updated infrequently, while it is still updated on other 
tables:


 schemaname | relname | n_live_tup | n_dead_tup | 
last_autovacuum
+-+++---
 pg_catalog | pg_largeobject  |  37205 |   92614852 | 
2020-06-15 01:55:09.037101+00
 pg_toast   | pg_toast_973434724  |281 |   3701 | 
2020-06-15 17:22:54.746452+00
 pg_catalog | pg_largeobject_metadata |320 |   1236 | 
2020-06-15 17:23:49.765878+00
 pg_catalog | pg_shdepend |344 |933 | 
2020-06-15 17:23:54.448855+00
 pg_toast   | pg_toast_2619   | 52 | 85 | 
2020-06-15 17:22:01.387933+00
 pg_catalog | pg_statistic|413 |100 | 
2020-06-15 17:15:52.656791+00
 pg_catalog | pg_class|349 |  2 | 
2020-06-12 17:58:13.147931+00
 pg_catalog | pg_attribute|   2633 |  1 | 
2020-06-12 17:58:13.000114+00


Looking at our PostgreSQL logs, it looks like the autovacuum task takes 
increasingly more time to run (against the pg_largeobject table):

2020-06-12T19:41:58.335931494Z stderr F system usage: CPU: user: 
0.02 s, system: 0.00 s, elapsed: 1.77 s
2020-06-12T19:42:59.704884752Z stderr F system usage: CPU: user: 
0.02 s, system: 0.02 s, elapsed: 3.06 s
2020-06-12T19:44:01.928300512Z stderr F system usage: CPU: user: 
0.06 s, system: 0.01 s, elapsed: 5.44 s
2020-06-12T19:45:14.124209167Z stderr F system usage: CPU: user: 
0.11 s, system: 0.05 s, elapsed: 17.13 s
2020-06-12T19:46:16.28758936Z stderr F  system usage: CPU: user: 0.13 s, 
system: 0.08 s, elapsed: 19.04 s
2020-06-12T19:47:34.264882626Z stderr F system usage: CPU: user: 
0.20 s, system: 0.19 s, elapsed: 36.22 s
2020-06-12T19:49:15.383436343Z stderr F system usage: CPU: user: 
0.28 s, system: 0.38 s, elapsed: 74.06 s
2020-06-12T19:53:47.229361981Z stderr F system usage: CPU: user: 
0.66 s, system: 1.06 s, elapsed: 214.12 s
2020-06-12T20:19:39.619748109Z stderr F system usage: CPU: user: 
4.13 s, system: 5.30 s, elapsed: 1461.16 s
2020-06-12T21:30:31.634637945Z stderr F system usage: CPU: user: 
13.73 s, system: 19.59 s, elapsed: 4225.61 s
2020-06-12T23:54:32.511015886Z stderr F system usage: CPU: user: 
33.57 s, system: 41.90 s, elapsed: 8514.23 s
2020-06-13T04:23:32.230960572Z stderr F system usage: CPU: user: 
66.09 s, system: 82.95 s, elapsed: 16011.25 s
2020-06-13T12:00:37.43434175Z stderr F  system usage: CPU: user: 99.42 s, 
system: 130.48 s, elapsed: 27296.71 s
2020-06-14T02:40:25.19122979Z stderr F  system usage: CPU: user: 202.96 s, 
system: 263.66 s, elapsed: 52653.66 s
2020-06-15T01:55:09.03766272Z stderr F  system usage: CPU: user: 317.54 s, 
system: 544.48 s, elapsed: 83550.21 s

We have several instances of our cloud service, and each instance has it's 
own database. On other instances, the autovacuum elapsed is consistently 
less than 25 seconds on every run.

Other than the increasing elapsed times for the autovacuum, we don't see 
any other indication in the logs of a problem (no error messages, etc).

We're currently using PostgreSQL version 10.10. Our service is JVM-based 
and we're using the PostgreSQL JDBC driver version 42.2.5.

Have we stumbled upon a potential bug here, or do we need to tweak some 
autovacuum settings?  What should we look at next or what should we try 
next to further troubleshoot this?

Regards,

Jim Hurne






RE: autovacuum failing on pg_largeobject and disk usage of the pg_largeobject growing unchecked

2020-06-16 Thread Jim Hurne
Thanks Michael,

Here are our current autovacuum settings:

 autovacuum  | on
 autovacuum_analyze_scale_factor | 0.1
 autovacuum_analyze_threshold| 50
 autovacuum_freeze_max_age   | 2
 autovacuum_max_workers  | 3
 autovacuum_multixact_freeze_max_age | 4
 autovacuum_naptime  | 1min
 autovacuum_vacuum_cost_delay| 20ms
 autovacuum_vacuum_cost_limit| -1
 autovacuum_vacuum_scale_factor  | 0.2
 autovacuum_vacuum_threshold | 50
 autovacuum_work_mem | -1
 log_autovacuum_min_duration | 1s

>  Is this a destructive queue of sorts with no rows permanently stored?

Essentially, yes. The system can create and delete objects at a high rate. 
Here are some recent stats on the pg_largeobject table:

# select * from pg_stat_all_tables where relname = 'pg_largeobject';
-[ RECORD 1 ]---+---
relid   | 2613
schemaname  | pg_catalog
relname | pg_largeobject
seq_scan| 0
seq_tup_read| 0
idx_scan| 66619475
idx_tup_fetch   | 126816721
n_tup_ins   | 57580140
n_tup_upd   | 188474
n_tup_del   | 57640395
n_tup_hot_upd   | 108845
n_live_tup  | 47771
n_dead_tup  | 57787135
n_mod_since_analyze | 115409009
last_vacuum |
last_autovacuum |
last_analyze|
last_autoanalyze|
vacuum_count| 0
autovacuum_count| 0
analyze_count   | 0
autoanalyze_count   | 0

# SELECT pg_size_pretty( pg_total_relation_size('pg_largeobject') );
-[ RECORD 1 ]--+---
pg_size_pretty | 350 GB 

The total size of the actual undeleted large objects is about 60 MB.

Regards,

Jim Hurne




From:   Michael Lewis 
To: Jim Hurne 
Cc: PostgreSQL General 
Date:   06/16/2020 01:06 PM
Subject:[EXTERNAL] Re: autovacuum failing on pg_largeobject and 
disk usage of the pg_largeobject growing unchecked



On Tue, Jun 16, 2020 at 10:01 AM Jim Hurne  wrote:
Other than the increasing elapsed times for the autovacuum, we don't see 
any other indication in the logs of a problem (no error messages, etc).

We're currently using PostgreSQL version 10.10. Our service is JVM-based 
and we're using the PostgreSQL JDBC driver version 42.2.5.

Have we stumbled upon a potential bug here, or do we need to tweak some 
autovacuum settings?  What should we look at next or what should we try 
next to further troubleshoot this?


What are your current autovacuum settings? Do you have long 
running transactions with any frequency? Decreasing 
autovacuum_vacuum_cost_delay to 1 or 2ms may be prudent (default changes 
from 20ms down to 2ms with PG 12). Is this a destructive queue of sorts 
with no rows permanently stored? If so, I would expect that a daily 
scheduled vacuum analyze may be the best course of action.







Re: Sv: autovacuum failing on pg_largeobject and disk usage of the pg_largeobject growing unchecked

2020-06-17 Thread Jim Hurne
> In my experience vacuumlo, 
https://www.postgresql.org/docs/12/vacuumlo.html,  is needed to remove 
large objects, before vacuum can remove them from pg_largeobject.

Thanks for the suggestion! vacuumlo deletes any orphaned large objects, 
presumably by using lo_unlink. While we don't use vacuumlo, our service 
does clean up orphaned large objects in a similar way. And indeed, 
inspecting the database reveals that there are no orphaned large objects.

Regards,

Jim Hurne





RE: autovacuum failing on pg_largeobject and disk usage of the pg_largeobject growing unchecked

2020-06-17 Thread Jim Hurne
Michael Lewis  wrote on 06/16/2020 04:41:16 PM:
> Still, if you run a manual vacuum analyze verbose, what sort of output 
do you get?

On one of the instances that is exhibiting the "disk leak" behavior, the 
VACUUM ANALYZE VERBOSE command doesn't generate any output or complete 
before I loose the connection to the database (presumably because I hit a 
connection read timeout). Is it possible to configure th read timeout for 
psql?

One some of our healthy instances, we were able to run VACUUM ANALYZE 
VERBOSE:

=> vacuum verbose analyze pg_largeobject;
INFO:  vacuuming "pg_catalog.pg_largeobject"
INFO:  scanned index "pg_largeobject_loid_pn_index" to remove 630 row 
versions
DETAIL:  CPU: user: 0.06 s, system: 0.10 s, elapsed: 0.17 s
INFO:  "pg_largeobject": removed 630 row versions in 190 pages
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  index "pg_largeobject_loid_pn_index" now contains 533 row versions 
in 18346 pages
DETAIL:  630 index row versions were removed.
18340 index pages have been deleted, 18339 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "pg_largeobject": found 577 removable, 533 nonremovable row 
versions in 399 out of 399 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 69015245
There were 550 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.06 s, system: 0.11 s, elapsed: 0.17 s.
INFO:  "pg_largeobject": truncated 399 to 305 pages
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  analyzing "pg_catalog.pg_largeobject"
INFO:  "pg_largeobject": scanned 305 of 305 pages, containing 533 live 
rows and 0 dead rows; 533 rows in sample, 533 estimated total rows
VACUUM


> Are there indexes on this table that perhaps are very large and 
> needing to be rescanned many times because maintenance_work_mem
> isn't high enough to handle in a single pass?

On the same healthy instance where we were able to run the VACUUM ANALYZE 
VERBOSE, the index size does seem reasonably small:

=> SELECT pg_size_pretty (pg_indexes_size('pg_largeobject'));
-[ RECORD 1 ]--+---
pg_size_pretty | 143 MB

But on the unhealthy instance, it is much larger:

=> SELECT pg_size_pretty (pg_indexes_size('pg_largeobject'));
 pg_size_pretty

 7241 MB
(1 row)

But it isn't clear to me if the index size is a symptom or if it's the 
actual root cause.

> You might try "create index concurrently, drop index concurrently, 
> & rename index" (reindex concurrently if you were on 
> PG 12) as a sort of online 'vacuum full' on the index(es).

Unfortunately, since pg_largeobject is a system table, the user we use to 
connect to the database doesn't have permissions to do this. We get a "
must be owner of relation pg_largeobject" error when we try to create the 
replacement index (using CREATE INDEX CONCURRENTLY).

> By the way, the best practices for these mailing list suggest 
> partial quoting and responding in-line or below, not "top posting" 
> with the entire conversation below.

My mistake! Sorry about that.


Regards,

Jim Hurne






RE: autovacuum failing on pg_largeobject and disk usage of the pg_largeobject growing unchecked

2020-06-22 Thread Jim Hurne
We're still struggling to figure out why autovacuum seems to be failing or 
is ineffective sometimes against the pg_largeobject table. 

We decided to try a VACUUM FULL VERBOSE on one of our problematic 
databases. It did complete, but it took a surprisingly long time:

  INFO:  vacuuming "pg_catalog.pg_largeobject"
  INFO:  "pg_largeobject": found 97326130 removable, 22230 nonremovable 
row versions in 207508868 pages
  DETAIL:  0 dead row versions cannot be removed yet.
  CPU: user: 1085.87 s, system: 3803.66 s, elapsed: 10883.94 s.
  VACUUM

So it took about 3 hours.  It's surprising because there was only 58 MB of 
undeleted data in about 65 large objects. When we ran this particular 
vacuum, we made sure there were no other processes connected to the DB and 
that no other statements were running that would have locked the table. 
It's my understanding that a VACUUM FULL works by creating a brand new 
table and copying the "non-deleted" data into the new table. If that's the 
case, I would have expected it to complete very quickly, given how little 
"non-deleted" data existed at the time.

We are of course going to continue to try different things, but does 
anyone have any other suggestions on what we should be looking at or what 
settings we might want to adjust?

For reference, here was my original problem description:

> We have a cloud service that uses PostgreSQL to temporarily store binary 

> content. We're using PostgreSQL's Large Objects to store the binary 
> content. Each large object lives anywhere from a few hundred 
milliseconds 
> to 5-10 minutes, after which it is deleted.
> 
> Normally, this works just fine and we have no issues. However, roughly 
> every 3 months or so, disk usage on our PostgreSQL database starts to 
> increase at an alarming rate. More specifically, the pg_largeobject 
> table's disk usage steadily increases even though we've deleted most of 
> the large objects. This will continue until the database server runs out 

> of disk space (and then all sorts of bad things happen of course).
> 
> It looks to us like autovacuum against the pg_largeobject table stops 
> working because the pg_stat_all_tables.last_autovacuum column stops 
> updating, or is updated infrequently, while it is still updated on other 

> tables.
>
> We're currently using PostgreSQL version 10.10. Our service is JVM-based 

> and we're using the PostgreSQL JDBC driver version 42.2.5.


Regards,

Jim Hurne





RE: autovacuum failing on pg_largeobject and disk usage of the pg_largeobject growing unchecked

2020-06-23 Thread Jim Hurne
Magnus Hagander  wrote on 06/22/2020 04:44:33 PM:
> Yes, that's how VACUUM FULL works (more or less). 

Thanks for the confirmation and further explanation Magnus! I'm definitely 
getting a better idea of how the vacuum and autovacuum facilities work.


> And for autovacuum, with a cost_delay of 20ms and a cost_limit of 
> 200, autovacuum would spend about 55 hours just on vacuum delay for 
> the reads...

Ah, very interesting. That explains (at least in part) why the autovacuums 
seem to be taking a long time. We'll consider adjusting the cost_delay and 
cost_limit settings. Is there any downsides to setting a very low 
cost_delay (maybe even 0ms), or to setting a large cost_limit (such as 
100,000)?


> The logs you posted originally seem to be excluding the actual 
> autovacuum details -- can you include those?

Sure! Below are more of the details from the same set of logs. Looking at 
them myself, I see that there is always some percentage of tuples that are 
dead but are not yet removable. And that number increases on every vacuum, 
which might explain in part why autovacuum elapsed times keep increasing. 

What causes a dead tuple to be unremovable?

-- LOGS --
2020-06-12T19:41:58.335881879Z stderr F 2020-06-12 19:41:58 UTC [528141]: 
[2-1] user=,db=,client= LOG:  automatic vacuum of table 
"ibmclouddb.pg_catalog.pg_largeobject": index scans: 0
2020-06-12T19:41:58.335899718Z stderr F pages: 0 removed, 8649 
remain, 0 skipped due to pins, 0 skipped frozen
2020-06-12T19:41:58.335908658Z stderr F tuples: 0 removed, 30445 
remain, 17410 are dead but not yet removable, oldest xmin: 1942750826
2020-06-12T19:41:58.335917078Z stderr F buffer usage: 17444 hits, 
0 misses, 0 dirtied
2020-06-12T19:41:58.335925217Z stderr F avg read rate: 0.000 MB/s, 
avg write rate: 0.000 MB/s
2020-06-12T19:41:58.335931494Z stderr F system usage: CPU: user: 
0.02 s, system: 0.00 s, elapsed: 1.77 s
2020-06-12T19:42:59.704833148Z stderr F 2020-06-12 19:42:59 UTC [528231]: 
[2-1] user=,db=,client= LOG:  automatic vacuum of table 
"ibmclouddb.pg_catalog.pg_largeobject": index scans: 1
2020-06-12T19:42:59.704852755Z stderr F pages: 0 removed, 14951 
remain, 0 skipped due to pins, 0 skipped frozen
2020-06-12T19:42:59.704861911Z stderr F tuples: 2 removed, 53054 
remain, 27127 are dead but not yet removable, oldest xmin: 1942751009
2020-06-12T19:42:59.704870553Z stderr F buffer usage: 30144 hits, 
0 misses, 3 dirtied
2020-06-12T19:42:59.704878195Z stderr F avg read rate: 0.000 MB/s, 
avg write rate: 0.008 MB/s
2020-06-12T19:42:59.704884752Z stderr F system usage: CPU: user: 
0.02 s, system: 0.02 s, elapsed: 3.06 s
2020-06-12T19:44:01.928225496Z stderr F 2020-06-12 19:44:01 UTC [528326]: 
[2-1] user=,db=,client= LOG:  automatic vacuum of table 
"ibmclouddb.pg_catalog.pg_largeobject": index scans: 0
2020-06-12T19:44:01.928265589Z stderr F pages: 0 removed, 22395 
remain, 0 skipped due to pins, 0 skipped frozen
2020-06-12T19:44:01.928276366Z stderr F tuples: 0 removed, 76562 
remain, 37235 are dead but not yet removable, oldest xmin: 1942751009
2020-06-12T19:44:01.9282856Z stderr F   buffer usage: 45131 hits, 0 
misses, 433 dirtied
2020-06-12T19:44:01.928293976Z stderr F avg read rate: 0.000 MB/s, 
avg write rate: 0.621 MB/s
2020-06-12T19:44:01.928300512Z stderr F system usage: CPU: user: 
0.06 s, system: 0.01 s, elapsed: 5.44 s
2020-06-12T19:45:14.124140716Z stderr F 2020-06-12 19:45:14 UTC [528433]: 
[2-1] user=,db=,client= LOG:  automatic vacuum of table 
"ibmclouddb.pg_catalog.pg_largeobject": index scans: 0
2020-06-12T19:45:14.124178864Z stderr F pages: 0 removed, 31029 
remain, 0 skipped due to pins, 0 skipped frozen
2020-06-12T19:45:14.124187376Z stderr F tuples: 0 removed, 104441 
remain, 65915 are dead but not yet removable, oldest xmin: 1942751009
2020-06-12T19:45:14.124196608Z stderr F buffer usage: 57692 hits, 
4834 misses, 2095 dirtied
2020-06-12T19:45:14.124203046Z stderr F avg read rate: 2.204 MB/s, 
avg write rate: 0.955 MB/s
2020-06-12T19:45:14.124209167Z stderr F system usage: CPU: user: 
0.11 s, system: 0.05 s, elapsed: 17.13 s
2020-06-12T19:46:16.287517553Z stderr F 2020-06-12 19:46:16 UTC [528529]: 
[2-1] user=,db=,client= LOG:  automatic vacuum of table 
"ibmclouddb.pg_catalog.pg_largeobject": index scans: 0
2020-06-12T19:46:16.287558376Z stderr F pages: 0 removed, 36257 
remain, 0 skipped due to pins, 0 skipped frozen
2020-06-12T19:46:16.287567454Z stderr F tuples: 0 removed, 122273 
remain, 81080 are dead but not yet removable, oldest xmin: 1942751009
2020-06-12T19:46:16.287575752Z stderr F buffer usage: 62700 hits, 
10366 misses, 76 dirtied
2020-06-12T19:46:16.287582866Z stderr F avg read rate: 4.252 MB/s, 
avg write rate: 0.031 MB/s
2020-06-12T19:46:16.28758936Z stderr F  system usage: CPU: user: 0.13 s, 
system: 0.08 s, elapsed: 1

RE: autovacuum failing on pg_largeobject and disk usage of the pg_largeobject growing unchecked

2020-06-24 Thread Jim Hurne
"Daniel Verite"  wrote on 06/22/2020 06:00:37 PM:
> If you can arrange a maintenance window, a faster way to rebuild
> pg_largeobject when it contains mostly empty pages can be to:

Thanks Daniel. If we cannot arrange for a maintenance window, is this the 
sort of thing that pg_repack[1] could be used to do?

Would pg_repack require setting allow_system_table_mods to on?

[1] https://github.com/reorg/pg_repack

Regards,

Jim Hurne





RE: autovacuum failing on pg_largeobject and disk usage of the pg_largeobject growing unchecked

2020-06-24 Thread Jim Hurne
"Daniel Verite"  wrote on 06/24/2020 10:08:27 AM:
> Unfortunately it [pg_repack] can't work with pg_largeobject

That is unfortunate, and potentially leaves us in a difficult spot.

Is it possible to configure PosgreSQL to use a user table for large 
objects instead of a system table? 

We're finding it to be especially difficult to maintain the pg_largeobject 
system table (tweak autovacuum settings, manually cleanup with something 
like pg_repack, etc), particularly since our PosrgreSQL instances are 
hosted and controlled by another team.

Regards,

Jim Hurne





Re: autovacuum failing on pg_largeobject and disk usage of the pg_largeobject growing unchecked

2020-06-24 Thread Jim Hurne
Michael Lewis  wrote on 06/23/2020 04:44:51 PM:
> Long running transactions are the common one that I see. You might 
> be dealing with replication slots or prepared transactions. 
> Basically, if some process might see that "old truth", then it can't
> be vacuumed away yet.

Thanks, those links you provided were informative.

Our application doesn't use prepared transactions, so that one is easy to 
eliminate.

Our database does use replication, but there's only one replication slot 
and the xmin and catalog_xmin columns are blank. I presume the only 
replication slot that exists is the one that the other replica is in fact 
using. I *think* this means replication isn't the issue, but it's hard to 
say since the xmin column is blank (and all the reading I've done doesn't 
even mention that the xmin on replication slots can be null).

That brings us to transactions. I looked for any long-running idle 
transaction (transactions that have been idle for more than 15 or 5 
minutes), but found none.

I tried:

SELECT pid, datname, usename, state, backend_xmin
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
ORDER BY age(backend_xmin) DESC;

But the backend_xmin for all of the rows returned is exactly the same, and 
that xmin is greater than the oldest xmin reported in the autovacuum logs. 
It does seem odd that the backend_xmin value isn't changing though. Is 
that normal?

So, for replication slots, I'm seeing a null xmin value, which the 
articles do not comment on how that should be interpreted. And for 
transactions, all of the transaction xmins are the same, which also seems 
odd and not what the articles suggested. I know the transactions 
themselves are coming and going because I can see the pids changing, but 
the xmins are always the same. Strange.

Having to check the logs to see what the oldest xmin is painful, and 
requires that a vacuum or autovacuum has been executed. Is there another 
way to check what the oldest xmin is on the dead tuples of a table?

Regards,

Jim Hurne





RE: autovacuum failing on pg_largeobject and disk usage of the pg_largeobject growing unchecked

2020-06-26 Thread Jim Hurne
"Jim Hurne"  wrote on 06/24/2020 03:44:26 PM:
> That brings us to transactions. I looked for any long-running idle 
> transaction (transactions that have been idle for more than 15 or 5 
> minutes), but found none.

It turns out I didn't see any long running transactions because I was 
using a user who didn't have permissions to view most of the data in the 
pg_stat_activity view.  When we looked using a more powerful user, we 
found four transactions that had been idle for a long time. Killing those 
transactions allowed vacuums to clean up the dead tuples.

This is most likely the root of all of our problems, so we can essentially 
consider this mystery solved for now.

Thanks to all who gave us advice and information. It's been highly 
educational. I'm just sorry it turned out to be a stupid "long running 
transaction" problem. We'll know to look for that first in the future.