Orphaned relations after crash/sigkill during CREATE TABLE

2020-08-18 Thread Jason Myers
Postgres 12.4

I was directed in slack to mention here that we're being impacted by
Postgres leaving orphaned pages in /base/ after a crash while a CREATE
TABLE is being run in transaction.

The issue is the same as the reproduction steps listed here [1], that is:

- Start a CREATE TABLE transaction for a large table
- Terminate the process via kill -9
- Pages are left in /base that have no filenode references anymore, such
that `pg_database_size()` grows to be very large while total table+index
size remains constant

However in our particular case, we're using a managed/cloud Postgres server
and our `CREATE TABLE` transaction was being terminated by the OOM killer.
Using a managed service, we don't have filesystem access to go and clear
out these orphaned pages.  This caused our total db size to grow from 40GB
of table+index data to 4TB on-disk (but still only 40GB of table+index
data, the other ~3.95TB being orphaned CREATE TABLE pages)

I realize (per a blog post from Robert Haas [2] and from slack
conversation) that this is a known issue, but was directed here from slack
to just mention that we were impacted by it, and have no resolution due to
not having filesystem access, and not having a method internally to
Postgres to deal with these orphaned relations.  (Our recourse currently is
to do something like a pg_dump/pg_restore onto a clean instance in order to
escape the orphaned files)

-Jason

[1] https://github.com/bdrouvot/pg_orphaned#example-1
[2]
https://rhaas.blogspot.com/2020/05/dont-manually-modify-postgresql-data.html


Re: Orphaned relations after crash/sigkill during CREATE TABLE

2020-08-18 Thread Jason Myers
> On Tue, Aug 18, 2020 at 3:49 PM Adrian Klaver 
wrote:

> > Have you tried with:
> BEGIN;
> CREATE TABLE some_table SELECT some_data FROM other_table LIMIT 1 WITH
> NO DATA;
> COMMIT;
>
> The above gets you the table structure, but no data.
>
> BEGIN;
> INSERT into some_table SELECT * FROM other_table;
>COMMIT;
>
> The above populates the table

Thanks -- we were indeed creating and populating the new table all in a
single transaction.

I'll see if we can split this into two transactions so that the table
structure is committed quickly.  I think you're right that this would
mostly sidestep the issue.

-Jason

p.s. Apologies if this is formatted wrong, this is my first mailing list
post.


Re: Orphaned relations after crash/sigkill during CREATE TABLE

2020-08-20 Thread Jason Myers
On Tue, Aug 18, 2020 at 3:49 PM Adrian Klaver 
wrote:
> So from [1] you are using CREATE TABLE AS. Have you tried with:
>
> BEGIN;
> CREATE TABLE some_table SELECT some_data FROM other_table LIMIT 1 WITH
> NO DATA;
> COMMIT;
>
> The above gets you the table structure, but no data.
>
> BEGIN;
> INSERT into some_table SELECT * FROM other_table;
> COMMIT;
>
> The above populates the table. Have not tested but I'm going to assume
> if you kill the above the problem would not happen or would be fixable
> by DELETE FROM some_table/TRUNCATE some_table;

I was able to implement this, which creates the table quickly in a first
transaction and populates it in a second transaction.

However we were still seeing orphaned files on crash, and I believe I
tracked it down to subsequent CREATE INDEX statements also creating these
orphaned files (if they are running during a crash).

Is that issue known as well?  I don't believe I can use the same trick to
sidestep that one...

-Jason


Re: Orphaned relations after crash/sigkill during CREATE TABLE

2020-08-20 Thread Jason Myers
On Thu, Aug 20, 2020 at 5:46 PM Tom Lane  wrote:

> Jason Myers  writes:
> > However we were still seeing orphaned files on crash, and I believe I
> > tracked it down to subsequent CREATE INDEX statements also creating these
> > orphaned files (if they are running during a crash).
> > Is that issue known as well?  I don't believe I can use the same trick to
> > sidestep that one...
>
> Yeah, it's entirely intentional that we don't try to clean up orphaned
> disk files after a database crash.  There's a long discussion of this and
> related topics in src/backend/access/transam/README.  What that says about
> why not is that such files' contents might be useful for forensic analysis
> of the crash, and anyway "Orphan files are harmless --- at worst they
> waste a bit of disk space".  A point not made in that text, but true
> anyway, is that it'd also be quite expensive to search a large database
> for orphaned files, so people would likely not want to pay that price
> on the way to getting their database back up.
>
> There might be value in a user-invokable tool that runs in an existing
> non-crashed database and looks for orphan files, but I'm not aware that
> anyone has written one.  (Race conditions against concurrent table
> creation would be a problem; but probably that can be finessed somehow,
> maybe by noting the file's creation time.)
>
> In the meantime I've got to say that routinely kill 9'ing database
> processes just doesn't seem like a very good idea.  Yeah, we do our best
> to ensure that there won't be data loss, but you're really doubling down
> on a hard assumption that Postgres contains zero bugs when you operate
> that way.  I'd suggest reconfiguring things to avoid the OOM kill hazard;
> or if your cloud provider makes that effectively impossible, maybe you
> need another provider.  But on most systems I'd think you could use ulimit
> or the like even if you don't have root privileges.
>
> regards, tom lane
>

Understood, thanks for the reply.

-Jason