parallel pg_restore blocks on heavy random read I/O on all children processes

2025-03-20 Thread Dimitrios Apostolou

Hello list,

I noticed the weird behaviour that doing a pg_restore of a huge database
dump, leads to constant read I/O (at about 15K IOPS from the NVMe drive
that has the dump file) for about one hour. I believe it happens with
any -j value>=2.

In particular, I get output like the following in the pg_restore log, only
a few seconds after running it:

pg_restore: launching item 12110 TABLE DATA yyy
pg_restore: processing data for table "public.yyy"
[ long pause ...]
pg_restore: finished item 12110 TABLE DATA yyy

The output varies a bit, depending on when each process comes to the
bottleneck point, and it is hard to isolate each process' log since the
output is not prefixed with the PID, so take it with a grain of salt.

If I strace the pg_restore children, I see them read()ing 4K and seek()ing
a couple blocks forward, all the time:

read(4, "..."..., 4096) = 4096
lseek(4, 55544369152, SEEK_SET) = 55544369152
read(4, "..."..., 4096) = 4096
lseek(4, 55544381440, SEEK_SET) = 55544381440
read(4, "..."..., 4096) = 4096
lseek(4, 55544397824, SEEK_SET) = 55544397824
read(4, "..."..., 4096) = 4096
lseek(4, 55544414208, SEEK_SET) = 55544414208
read(4, "..."..., 4096) = 4096
lseek(4, 55544426496, SEEK_SET) = 55544426496

No other syscalls are occuring (no sends or writes whatsoever).

After the ~1h pause, I don't notice the behaviour again during the many
hours restore, even though there are plenty of larger tables in the dump.
The normal restore activity resumes, during which the pg_restore processes
are barely noticeable in the system, and the COPY postgres backend
processes do most of the work.

Here is the exact pg_restore command that reproduced the issue for me last
on PostgreSQL 17:

pg_restore  -vvv -U $DBUSER -j8 -d $DBNAME --data-only --no-tablespaces 
--no-table-access-method --strict-names  $DUMP_FILENAME

And here is the pg_dump command which has created the dump file, executed
on PostgreSQL 16.

pg_dump -v --format=custom --compress=zstd --no-toast-compression  $DBNAME   |  
$send_to_remote_storage

The database is multiple TBs large with a couple thousands tables (notable
is one huge table with 1000 partitions). The zstd-compressed custom format
dump file is 816 GB in size.

What do you think causes this? Is it something that can be improved?

Thanks in advance,
Dimitris




Re: parallel pg_restore blocks on heavy random read I/O on all children processes

2025-03-20 Thread Tom Lane
Dimitrios Apostolou  writes:
> I noticed the weird behaviour that doing a pg_restore of a huge database
> dump, leads to constant read I/O (at about 15K IOPS from the NVMe drive
> that has the dump file) for about one hour. I believe it happens with
> any -j value>=2.

> In particular, I get output like the following in the pg_restore log, only
> a few seconds after running it:

> pg_restore: launching item 12110 TABLE DATA yyy
> pg_restore: processing data for table "public.yyy"
> [ long pause ...]
> pg_restore: finished item 12110 TABLE DATA yyy

I am betting that the problem is that the dump's TOC (table of
contents) lacks offsets to the actual data of the database objects,
and thus the readers have to reconstruct that information by scanning
the dump file.  Normally, pg_dump will back-fill offset data in the
TOC at completion of the dump, but if it's told to write to an
un-seekable output file then it cannot do that.

> And here is the pg_dump command which has created the dump file, executed
> on PostgreSQL 16.

> pg_dump -v --format=custom --compress=zstd --no-toast-compression  $DBNAME   
> |  $send_to_remote_storage

Yup, writing the output to a pipe would cause that ...

> What do you think causes this? Is it something that can be improved?

I don't see an easy way, and certainly no way that wouldn't involve
redefining the archive format.  Can you write the dump to a local
file rather than piping it immediately?

regards, tom lane