How to avoid UPDATE performance degradation in a transaction

2020-02-13 Thread Karl Düüna
Hi

I recently came across a performance problem with a big transaction block,
which doesn't make sense to me and hopefully someone more knowledgeable can
explain the reasons and point out a direction for a solution.

-- TL; DR;

UPDATE on a row takes relatively constant amount of time outside a
transaction block, but running UPDATE on a single row over and over inside
a transaction gets slower and slower as the number of UPDATE operations
increases.

Why is updating the same row large number of times progressively slower
inside a transaction? And is there a way to avoid this performance
degradation?

I set up a POC repository to demonstrate the problem:
https://github.com/DeadAlready/pg-test

-- Backstory

Needed to run a large block of operations (a mix of inserts and updates) on
a table. It took a considerable amount of time inside a transaction and was
about 10x faster without the transaction. Since I need all the operations
to run as a single block that can be rolled back this was unsatisfactory.
Thus began my quest to locate the problem. Since the actual data structure
is complex and involves a bunch of triggers, foreign keys etc it took some
time to narrow down, but in the end I found that the structure itself is
irrelevant. The issue occurs even if you have a single two column table
with a handful of rows. The only requirement seems to be that the NR of
UPDATEs per single row is large. While the update performance inside a
transaction starts out faster than outside, the performance starts to
degrade from the get go. It really isn't noticeable until about 5k UPDATEs
on a single row. At around 100k UPDATEs it is about 2.5x slower than the
same operation outside the transaction block and about 4x slower than at
the beginning of the transaction.

Thanks,
Karl


Re: How to avoid UPDATE performance degradation in a transaction

2020-02-13 Thread David G. Johnston
On Thu, Feb 13, 2020 at 1:42 PM Karl Düüna  wrote:

> It really isn't noticeable until about 5k UPDATEs on a single row.
>

Don't know why, and never dealt with a scenario where this would even come
up, but that this doesn't perform well inside a transaction isn't
surprising to me.  Kinda surprised it works well at all actually.  I'd
probably try and rework the processing algorithm to create an unlogged
temporary table with data from the row's initial state, manipulate until my
heart's content, then take the final result and update the single live row
with the final state.

David J.


Re: How to avoid UPDATE performance degradation in a transaction

2020-02-13 Thread Tom Lane
=?UTF-8?B?S2FybCBEw7zDvG5h?=  writes:
> -- TL; DR;
> UPDATE on a row takes relatively constant amount of time outside a
> transaction block, but running UPDATE on a single row over and over inside
> a transaction gets slower and slower as the number of UPDATE operations
> increases.

Yeah, that's unsurprising.  Each new update creates a new version of
its row.  When you do them in separate transactions, then as soon as
transaction N+1 commits the system can recognize that the row version
created by transaction N is dead (no longer visible to anybody) and
recycle it, allowing the number of row versions present on-disk to
stay more or less constant.  However, there's not equivalently good
housekeeping for row versions created by a transaction that's still
running.  So when you do N updates in one transaction, there are going
to be N doomed-but-not-yet-recyclable row versions on disk.

Aside from the disk-space bloat, this is bad because the later updates
have to scan through all the row versions created by earlier updates,
looking for the version they're supposed to update.  So you have an O(N^2)
cost associated with that, which no doubt is what you're observing.

There isn't any really good fix for this, other than "don't do that".
David's nearby suggestion of using a temp table won't help, because
this behavior is the same whether the table is temp or regular.

In principle perhaps we could improve the granularity of dead-row
detection, so that if a row version is both created and deleted by
the current transaction, and we have no live snapshots that could
see it, we could go ahead and mark the row dead.  But it's not clear
that that'd be worth the extra cost to do.  Certainly no existing PG
release tries to do it.

regards, tom lane