Dealing with latency to replication slave; what to do?
Hi, Looking for any tips here on how to best maintain a replication slave which is operating under some latency between networks - around 230ms. On a good day/week, replication will keep up for a number of days, but however, when the link is under higher than average usage, keeping replication active can last merely minutes before falling behind again. 2018-07-24 18:46:14 GMTLOG: database system is ready to accept read only connections 2018-07-24 18:46:15 GMTLOG: started streaming WAL from primary at 2B/9300 on timeline 1 2018-07-24 18:59:28 GMTLOG: incomplete startup packet 2018-07-24 19:15:36 GMTLOG: incomplete startup packet 2018-07-24 19:15:36 GMTLOG: incomplete startup packet 2018-07-24 19:15:37 GMTLOG: incomplete startup packet As you can see above, it lasted about half an hour before falling out of sync. On the master, I have wal_keep_segments=128. What is happening when I see "incomplete startup packet" - is it simply the slave has fallen behind, and cannot 'catch up' using the wal segments quick enough? I assume the slave is using the wal segments to replay changes and assuming there are enough wal segments to cover the period it cannot stream properly, it will eventually recover?
Re: Dealing with latency to replication slave; what to do?
Hi Andres, regarding your first reply, I was inferring that from the fact I saw those messages at the same time the replication stream fell behind. What other logs would be more pertinent to this situation? On Tue, Jul 24, 2018 at 4:02 PM Andres Freund wrote: > Hi, > > On 2018-07-24 15:39:32 -0400, Rory Falloon wrote: > > Looking for any tips here on how to best maintain a replication slave > which > > is operating under some latency between networks - around 230ms. On a > good > > day/week, replication will keep up for a number of days, but however, > when > > the link is under higher than average usage, keeping replication active > can > > last merely minutes before falling behind again. > > > > 2018-07-24 18:46:14 GMTLOG: database system is ready to accept read only > > connections > > 2018-07-24 18:46:15 GMTLOG: started streaming WAL from primary at > > 2B/9300 on timeline 1 > > 2018-07-24 18:59:28 GMTLOG: incomplete startup packet > > 2018-07-24 19:15:36 GMTLOG: incomplete startup packet > > 2018-07-24 19:15:36 GMTLOG: incomplete startup packet > > 2018-07-24 19:15:37 GMTLOG: incomplete startup packet > > > > As you can see above, it lasted about half an hour before falling out of > > sync. > > How can we see that from the above? The "incomplete startup messages" > are independent of streaming rep? I think you need to show us more logs. > > > > On the master, I have wal_keep_segments=128. What is happening when I see > > "incomplete startup packet" - is it simply the slave has fallen behind, > > and cannot 'catch up' using the wal segments quick enough? I assume the > > slave is using the wal segments to replay changes and assuming there are > > enough wal segments to cover the period it cannot stream properly, it > will > > eventually recover? > > You might want to look into replication slots to ensure the primary > keeps the necessary segments, but not more, around. You might also want > to look at wal_compression, to reduce the bandwidth usage. > > Greetings, > > Andres Freund >
WAL streaming and dropping a large table
Hi, My postgres server setup is thus: a production database, which is using WAL streaming (hot standby) to four other hosts of various latencies. I Have a table that is ~200GB which has been backed up, and now I Want to remove it. The table is not in use, it is a child table to a parent table that _is_ in use but I foresee no issues here. The total DB size is roughly 300GB. The main reason for needing to remove it is to recover the space back on the partition which is humming away at 88% usage, and the table I wish to drop is better off in an archive somewhere. I just removed around 10% of it with 'delete from', which of course cascaded to the replication hosts. This increased the size of my pg_xlog folder (from ~3GB to ~6.5GB) which of course increased my partition usage. Obviously this is expected, but I wonder what would happen if I had issued the 'drop table'? I expect the nature of enabling max_replication_slots would mean the database would retain the wal segments until all have caught up; it could take quite a long time to 'catch up' after the 'drop table' command? It took about 10 minutes before the pg_xlog folder size had 'settled down' to what I normally see as default which is ~3GB. > wal_keep_segments is defined as 128, > wal_max_size is not defined, > max_replication_slots is enabled I'd prefer to use 'drop table' as it would recover the data immediately, but not if it will impact the production database in a way that the partition will become full, which defeats the purpose. Is it advisable to move the pg_xlog folder to another volume on the system with more space (which I have) and symlink - obviously a stop & start of the db required - and then let the WAL archives fill up as need be? Or am I missing something obvious (likely) Thanks