RE: Slow WAL recovery for DROP TABLE

2018-07-17 Thread Jamison, Kirk
Hi, I have also reported a similar problem in the hackers mailing list, but 
particularly on TRUNCATE TABLE.
https://www.postgresql.org/message-id/flat/D09B13F772D2274BB348A310EE3027C62FD6E6%40g01jpexmbkw24

The problem lies with the standby server’s replay as it does separate scanning 
of the whole shared buffer for each DROP/TRUNCATE TABLE in order to check if 
the table-to-delete is cached in shared buffer. Therefore, it will take a long 
recovery time and sometimes fail for large tables depending on shared_buffer 
size.

The main problem here is the scanning of shared_buffers, which not only affects 
drop/truncate table, but also drop database and vacuum as well. Other 
developers have contributed ideas and have been thinking of developing a buffer 
mapping implementation for PG12 that will solve all these, but it’s still a 
long shot from now.
But I think any working minor solutions/fixes from developers are also welcome, 
such as the recent committed patch for the multiple dropped tables per 
transaction with large shared_buffers.

Regards,
Kirk Jamison

From: Sherrylyn Branchaw [mailto:sbranc...@gmail.com]
Sent: Tuesday, July 17, 2018 11:41 PM
To: pgsql-gene...@postgresql.org
Subject: Slow WAL recovery for DROP TABLE

We are running Postgres 9.6.9 on RHEL 6.9. We're using built-in streaming 
replication, with a WAL archive for fallback purposes. No logical replication.

We recently had a bug in our code accidentally create several hundred thousand 
tables in a single database. A batch job started cleaning them up one at a time 
(separate transactions), over the course of a couple nights.

But what we noticed the next day was that the 2 standbys to this database were 
falling further and further behind. They were no longer streaming, instead they 
were recovering from the archive, and they were doing so very slowly.

Only a small amount of WAL was generated by the DROP TABLEs, relative to normal 
traffic, but the time to recover a single WAL file sometimes took over a 
minute, compared to 1 second normally. Meanwhile, new WALs were being generated 
by regular user traffic, and the database got 24 hours behind and getting 
worse, before we resynced the standbys, in order to not have to replay the 
WALs. That did the trick. But it seemed like this performance drop was drastic 
enough to be worth reporting.

I was able to reproduce the slow recovery time of WALs generated by DROP TABLE 
on a pair of test servers.

Normal writes (INSERT, UPDATE, DELETE, CREATE TABLE) generated WAL files that 
were recoverable on the order of 1 second each. The standby never fell behind, 
even when hundreds of WAL files were generated quickly. But when I generated 
WALs exclusively from DROP TABLE, it took 20-80 seconds to recover each one.

Some benchmarks:

For 10 minutes, I issued a bunch of rapid-fire INSERT statements to server1, 
the primary. These generated 144 WAL files. At the end of the 10 minutes, 
server2, the standby, had already recovered 143 of them and immediately 
completed the last one.

For 10 minutes, I issued a bunch of rapid-fire DROP TABLE statements to 
server1, the primary. These generated 28 WAL files. At the end of the 10 
minutes, server2, the standby, had only recovered 16 of them.

My expectation is that replaying binary changes on a standby should be at least 
as fast and sometimes faster than generating them locally. This is because when 
a logical command like "DROP TABLE" or "INSERT" is issued on a primary, the 
database has to check for dependencies, satisfy constraints, etc. But once all 
those decisions are made on the primary, the binary changes should be 
replayable much faster on the standby.

For example, server1, the primary in my test pair, has significantly better 
hardware than server2, the standby. If server2 is keeping up with server1 in 
replication, I predict that server2 should be able to get through the same 
amount of writes faster as a replica than as a standalone.

This expectation was borne out in my testing for INSERTs but not DROP TABLEs, 
per the following benchmarks.
I issued 9,000,000 INSERTs to server1, the primary. It took 10 minutes to 
complete these. At the end of the 10 minutes, server2, the standby, was caught 
up. When I made server2 a standalone, it took 15 minutes to get through the 
same 9,000,000 INSERTs. In conclusion, being a standalone was 50% slower for 
server2 when it came to INSERT.

I issued 100,000 DROP TABLEs to server1, the primary. It took 10 minutes to 
complete these on server 1, and 20 minutes for server2, the standby, to catch 
up. When I made server2 a standalone, it took 18 minutes to get through the 
same 100,000 DROP TABLEs. In conclusion, being a standalone was 10% faster for 
server2 when it came to DROP TABLE.

It seems that there is something especially inefficient about WAL files 
generated by DROP TABLE, or the process of recovering them.

I also want to mention that recovering a WAL file generated by DROP TABLE 
commands con

Tuning threshold for BAS_BULKREAD (large tables)

2019-01-21 Thread Jamison, Kirk
Hi,

I have a source code-related question on BufferAccessStrategyType BAS_BULKREAD.
Currently, this access method is set internally to cache tables larger than 1/4 
of shared_buffers.
src/backend/access/heap/heapam.c:initscan()
 if (!RelationUsesLocalBuffers(scan->rs_rd) &&
 scan->rs_nblocks > NBuffers / 4)
...
 /* During a rescan, keep the previous strategy object. */
 if (scan->rs_strategy == NULL)
 scan->rs_strategy = GetAccessStrategy(BAS_BULKREAD);

Users can tune their shared_buffers size, but not able to tune this component.
I'm just wondering how it affects the current workload when the table size is 
larger than the database.
Does it really cache the large tables in shared buffers? How does it affect 
other buffers/pages stored in the shared buffers?

Oracle also has a quite-related user parameter that allocates space for large 
tables in the buffer cache.
https://docs.oracle.com/database/121/VLDBG/GUID-A553169D-C6CD-443E-88C3-B746D5E32923.htm#VLDBG14145

I want to ask how has PostgreSQL optimized this with synchronized sequential 
scans, etc.?
If it's beneficial, I'm wondering if it would be helpful also in Postgres for 
users to tune it instead of the hardcoded threshold (Nbuffers / 4)?


RE: Tuning threshold for BAS_BULKREAD (large tables)

2019-01-22 Thread Jamison, Kirk
On Tuesday, January 22, 2019 5:36 PM, Ron wrote:
>How can a subset of the database be larger than the database?
Oops. Sorry, I made a mistake on that part. What I meant was how does Postgres 
handle the large relations caching in terms of performance, especially if 
majority of the data it has to read would take up most of the shared buffers. 
Say, more than 3/4 of shared buffers. Is the current buffer access strategy 
(BAS_BULKREAD) optimal for this?


From: Ron [mailto:ronljohnso...@gmail.com]
Sent: Tuesday, January 22, 2019 5:36 PM
To: pgsql-general@lists.postgresql.org
Subject: Re: Tuning threshold for BAS_BULKREAD (large tables)

On 1/22/19 1:35 AM, Jamison, Kirk wrote:

Hi,

I have a source code-related question on BufferAccessStrategyType BAS_BULKREAD.
Currently, this access method is set internally to cache tables larger than 1/4 
of shared_buffers.
src/backend/access/heap/heapam.c:initscan()
 if (!RelationUsesLocalBuffers(scan->rs_rd) &&
 scan->rs_nblocks > NBuffers / 4)
...
 /* During a rescan, keep the previous strategy object. */
 if (scan->rs_strategy == NULL)
 scan->rs_strategy = GetAccessStrategy(BAS_BULKREAD);

Users can tune their shared_buffers size, but not able to tune this component.
I'm just wondering how it affects the current workload when the table size is 
larger than the database.

How can a subset of the database be larger than the database?

--
Angular momentum makes the world go 'round.