Re: Bulk DML performance
Hello,
Regarding the additional time for UPDATE, you can try the following:
CREATE TABLE test3 (
id bigint PRIMARY KEY,
text1 text
) WITH (fillfactor=30);
See: https://www.postgresql.org/docs/17/storage-hot.html
My local test gives me almost the same time for INSERT (first insert) and
UPDATES (following upserts).
Regarding the overall problem, there is always room for improvement. I did
a quick test with partitions, and I found out that Postgres will not
parallelize the upserts for us. One solution could be to partition the
records at the application level, creating one connection per partition. On
the DB side, the partitions can be implemented as standard tables (using a
union view on top of them) or actual partitions of a main table. However,
this solution does not strictly respect the "one
single transaction'"constraint...
Regards,
Renan Fonseca
Em qui., 13 de mar. de 2025 às 08:40, escreveu:
> Hello! I’m building a system that needs to insert/update batches of
> millions of rows (using INSERT .. ON CONFLICT (…) DO UPDATE) in a single
> database transaction, where each row is about 1.5 kB. The system produces
> about 3 million rows (about 4.5 GB) of data in about 5 seconds, but
> PostgreSQL takes about 35 seconds to insert that data and about 55 seconds
> to update that data. This is both on my local dev machine as well as on a
> large AWS Aurora PostgreSQL instance (db.r8g.16xlarge with 64 vCPUs, 512 GB
> RAM and 30 Gbps).
>
>
>
> The following INSERT .. ON CONFLICT (…) DO UPDATE statement
> inserts/updates 3 million rows with only 9 bytes per row and takes about 8
> seconds on first run (to insert the rows) and about 14 seconds on
> subsequent runs (to update the rows), but is only inserting 27 MB of data
> (3 million rows with 9 bytes per row); although after the first run, SELECT
> pg_size_pretty(pg_total_relation_size('test')) reports the table size as
> 191 MB and after the second run reports the table size as 382 MB (adding
> another 191 MB).
>
>
>
> CREATE TABLE test (
>
> id bigint PRIMARY KEY,
>
> text1 text
>
> );
>
>
>
> INSERT INTO test (id, text1)
>
> SELECT generate_series, 'x'
>
> FROM generate_series(1, 300)
>
> ON CONFLICT (id) DO UPDATE
>
> SET text1 = 'x';
>
>
>
> If PostgreSQL is writing 191 MB on the first run and 382 MB on each
> subsequent run, then PostgreSQL is only writing about 28 MB/s. Although
> PostgreSQL is also able to write about 4.5 GB in about 35 seconds (as
> stated above), which is about 128 MB/s, so it seems the performance
> constraint depends on the number of rows inserted more than the size of
> each row.
>
>
>
> Furthermore, deleting the rows takes about 18 seconds to perform (about 4
> seconds longer than the time taken to update the rows):
>
>
>
> DELETE FROM test
>
> WHERE id in (
>
> SELECT * FROM generate_series(1, 300)
>
> )
>
>
>
> It seems like it should be possible to do better than this on modern
> hardware, but I don’t have enough knowledge of the inner workings of
> PostgreSQL to know whether my instinct is correct on this, so I thought I’d
> raise the question with the experts.
>
>
>
> Thanks!
>
> Bill
>
Re: Bulk DML performance
On Thu, 2025-03-13 at 12:05 +0800, [email protected] wrote: > The following INSERT .. ON CONFLICT (…) DO UPDATE statement inserts/updates > 3 million rows with only 9 bytes per row and takes about 8 seconds on first > run (to insert the rows) and about 14 seconds on subsequent runs (to update > the rows), I can confirm these times on my not so new laptop with local NVME. That's the time it takes if you have an index on the table and want to be crash safe. > but is only inserting 27 MB of data (3 million rows with 9 bytes > per row); although after the first run, > SELECT pg_size_pretty(pg_total_relation_size('test')) reports the table size > as 191 MB and after the second run reports the table size as 382 MB (adding > another 191 MB). That is unavoidable, because PostgreSQL adds a new version for each row to the table. To avoid that kind of bloat, you'd have to update in smaller batches and run VACUUM between those to free the "dead" row versions. > CREATE TABLE test ( > id bigint PRIMARY KEY, > text1 text > ); > > INSERT INTO test (id, text1) > SELECT generate_series, 'x' > FROM generate_series(1, 300) > ON CONFLICT (id) DO UPDATE > SET text1 = 'x'; > > If PostgreSQL is writing 191 MB on the first run and 382 MB on each subsequent > run, then PostgreSQL is only writing about 28 MB/s. Although PostgreSQL is > also able to write about 4.5 GB in about 35 seconds (as stated above), which > is about 128 MB/s, so it seems the performance constraint depends on the > number of rows inserted more than the size of each row. It is the index maintenance that is killing you. Without the primary key, the first insert takes under 1.8 seconds here. But you need the primary key index if you want to use INSERT ... ON CONFLICT. The update has to do even more work, so it is slower. If you don't need crash safety, you could use UNLOGGED tables and be somewhat faster (5.8 seconds for the initial INSERT here). Essentially, the moderate performance is the price you are paying for data integrity (crash safety) and consistency (primary key). > Furthermore, deleting the rows takes about 18 seconds to perform (about 4 > seconds longer than the time taken to update the rows): > > DELETE FROM test > WHERE id in ( > SELECT * FROM generate_series(1, 300) > ) Well, that is not a great statement. The following takes only 1.6 seconds here: DELETE FROM test WHERE id BETWEEN 1 AND 300; And if you want to delete all rows, TRUNCATE is very, very fast. > It seems like it should be possible to do better than this on modern > hardware, but I don’t have enough knowledge of the inner workings of > PostgreSQL to know whether my instinct is correct on this, so I thought > I’d raise the question with the experts. With the table as it is you won't get better performance if you want the features that a relational database provides. To get better performance, the best I can think of is to parallelize loading the data until you saturate CPU, disk or hit internal contention in the database. Yours, Laurenz Albe
Bulk DML performance
Hello! I'm building a system that needs to insert/update batches of millions
of rows (using INSERT .. ON CONFLICT (.) DO UPDATE) in a single database
transaction, where each row is about 1.5 kB. The system produces about 3
million rows (about 4.5 GB) of data in about 5 seconds, but PostgreSQL takes
about 35 seconds to insert that data and about 55 seconds to update that
data. This is both on my local dev machine as well as on a large AWS Aurora
PostgreSQL instance (db.r8g.16xlarge with 64 vCPUs, 512 GB RAM and 30 Gbps).
The following INSERT .. ON CONFLICT (.) DO UPDATE statement inserts/updates
3 million rows with only 9 bytes per row and takes about 8 seconds on first
run (to insert the rows) and about 14 seconds on subsequent runs (to update
the rows), but is only inserting 27 MB of data (3 million rows with 9 bytes
per row); although after the first run, SELECT
pg_size_pretty(pg_total_relation_size('test')) reports the table size as 191
MB and after the second run reports the table size as 382 MB (adding another
191 MB).
CREATE TABLE test (
id bigint PRIMARY KEY,
text1 text
);
INSERT INTO test (id, text1)
SELECT generate_series, 'x'
FROM generate_series(1, 300)
ON CONFLICT (id) DO UPDATE
SET text1 = 'x';
If PostgreSQL is writing 191 MB on the first run and 382 MB on each
subsequent run, then PostgreSQL is only writing about 28 MB/s. Although
PostgreSQL is also able to write about 4.5 GB in about 35 seconds (as stated
above), which is about 128 MB/s, so it seems the performance constraint
depends on the number of rows inserted more than the size of each row.
Furthermore, deleting the rows takes about 18 seconds to perform (about 4
seconds longer than the time taken to update the rows):
DELETE FROM test
WHERE id in (
SELECT * FROM generate_series(1, 300)
)
It seems like it should be possible to do better than this on modern
hardware, but I don't have enough knowledge of the inner workings of
PostgreSQL to know whether my instinct is correct on this, so I thought I'd
raise the question with the experts.
Thanks!
Bill
Re: Bulk DML performance
On Thu, 2025-03-13 at 18:13 +0800, [email protected] wrote: > > it is noteworthy that inserting 27 MB of data into a newly created table > creates > 191 MB of data including the index and 127 MB of data excluding the index. PostgreSQL has a lot of overhead per row. > > Can you help me understand why performing 3 million lookups on a b-tree index > with all pages cached in memory takes so long? It is probably not the lookup, but the *modification* of the index that is slow. > > It seems like deleting 3 million rows identified by 3 million IDs should be > faster than updating 3 million rows (also identified by 3 million IDs). It should be, yes. To find out where the time is spent, use EXPLAIN (ANALYZE, BUFFERS) on the statement. > > To get better performance, the best I can think of is to parallelize loading > > the data until you saturate CPU, disk or hit internal contention in the > > database. > > Sadly, I cannot do that because I need all rows to be inserted in a single > database transaction, which I cannot do over multiple database connections. Then the best you can do is to use COPY rather than INSERT. It will perform better (but now vastly better). Yours, Laurenz Albe
Re: Bulk DML performance
On Thu, 2025-03-13 at 12:28 +0100, I wrote: > Then the best you can do is to use COPY rather than INSERT. > It will perform better (but now vastly better). Sorry, I meant "*not* vastly better". Yours, Laurenz Albe
