Resync second slave to new master

2018-03-05 Thread Dylan Luong
Hi

We are currently on PostgreSQL 9.6 and our HA is configured as  master/slave 
with wal streaming replication.
We would like to setup a second slave which replicates from the same master.

Eg:
Current setup:

Slave <-- Master
Slave is setup for automatic failover and we use pg_rewind after failover to 
rewind the old master as new slave.

New setup:

Slave1   <-- Master --> Slave2

Slave 2 (new) is for reporting and nightly pg_dumps, as we have some very big 
databases.
Slave 1 is for HA and is setup automatic failover.

After a failover (promote) to the Slave1,  is it easily resync the Slave2 to 
the new master (old slave1)? Do we need to do  full rebuild of the Slave2 from 
new master everytime we failover to Slave1 from Master? Can we use pg_rewind on 
Slave2 to resyn it with new master (old slave1)?

Thanks
Dylan


RE: Resync second slave to new master

2018-03-05 Thread Dylan Luong
Thanks Michael,
So everytime after promoting Slave to become master (either manually or 
automatic), just stop Slave2 and run pg_rewind on slave2 against the new maser 
(old slave1). And when old master server is available again, use pg_rewind on 
that serve as well against new master to return to original configuration.

-Original Message-
From: Michael Paquier [mailto:mich...@paquier.xyz] 
Sent: Tuesday, 6 March 2018 3:54 PM
To: Dylan Luong 
Cc: pgsql-generallists.postgresql.org 
Subject: Re: Resync second slave to new master

On Tue, Mar 06, 2018 at 04:45:10AM +, Dylan Luong wrote:
> After a failover (promote) to the Slave1,  is it easily resync the
> Slave2 to the new master (old slave1)? Do we need to do  full rebuild 
> of the Slave2 from new master everytime we failover to Slave1 from 
> Master? Can we use pg_rewind on Slave2 to resyn it with new master 
> (old slave1)?

After promoting slave 1, it could be possible that some records have slipped to 
slave 2 from the primary.  In this case, a rewind would be recommended.  You 
should be careful that your slave 2 has not received WAL to a position newer 
than where WAL has forked because of the promotion.  If that happened, then a 
rewind would be necessary before replugging slave 2 to the newly-promoted 
server.  Be very careful with your failover flow here.  pg_rewind also would 
not run if it finds that the target server does not need a rewind, so you could 
stop the slave 2, and run pg_rewind unconditionally to keep things simple.
--
Michael



RE: Resync second slave to new master

2018-03-07 Thread Dylan Luong
Hi Michael,

I tested the failover today and the slave 2 failed to resync with the new 
master (old slave1).

After I promoted the slave1 to become master,  I was able to use pg_rewind on 
the old master and bring it back as new slave.

I then stopped slave2 and ran pg_rewind on slave2 against new master, it report 
that no rewind was required:

  $ pg_rewind -D /var/lib/pgsql/9.6/data 
--source-server="host=x.xxx. port=5432 user=postgres"
  servers diverged at WAL position 1BB/AB98 on timeline 5
  no rewind required

So I then updated the recovery.conf on slave2 with primary_conninfo equal to 
the new master IP.
When starting up posgres, it failed with the following error in the logs:

database system was shut down in recovery at 2018-03-08 17:52:10 ACDT
2018-03-08 17:56:27 ACDT [23026]: [2-1] db=,user= app=,host= LOG:  entering 
standby mode
cp: cannot stat '/pg_backup/backup/archive /0005.history': No such file or 
directory
cp: cannot stat '/pg_backup/backup/archive /000501BB00AB': No such 
file or directory
2018-03-08 17:56:27 ACDT [23026]: [3-1] db=,user= app=,host= LOG:  consistent 
recovery state reached at 1BB/AB98
2018-03-08 17:56:27 ACDT [23026]: [4-1] db=,user= app=,host= LOG:  record with 
incorrect prev-link 1B9/7340 at 1BB/AB98
2018-03-08 17:56:27 ACDT [23024]: [3-1] db=,user= app=,host= LOG:  database 
system is ready to accept read only connections
2018-03-08 17:56:27 ACDT [23032]: [1-1] db=,user= app=,host= LOG:  started 
streaming WAL from primary at 1BB/AB00 on timeline 5
2018-03-08 17:56:27 ACDT [23032]: [2-1] db=,user= app=,host= LOG:  replication 
terminated by primary server
2018-03-08 17:56:27 ACDT [23032]: [3-1] db=,user= app=,host= DETAIL:  End of 
WAL reached on timeline 5 at 1BB/AB98.
cp: cannot stat '/pg_backup/backup/archive_sync/000501BB00AB': No 
such file or directory
2018-03-08 17:56:27 ACDT [23032]: [4-1] db=,user= app=,host= LOG:  restarted 
WAL streaming at 1BB/AB00 on timeline 5
2018-03-08 17:56:27 ACDT [23032]: [5-1] db=,user= app=,host= LOG:  replication 
terminated by primary server
2018-03-08 17:56:27 ACDT [23032]: [6-1] db=,user= app=,host= DETAIL:  End of 
WAL reached on timeline 5 at 1BB/AB98.


On the new master in the /pg_backup/backup/archive folder I can see a file 
000501BB00AB.partial
Eg.
ls -l
-rw---. 1 postgres postgres 16777216 Mar  8 16:48 
000501BB00AB.partial
-rw---. 1 postgres postgres 16777216 Mar  8 16:49 000601BB00AB
-rw---. 1 postgres postgres 16777216 Mar  8 16:49 000601BB00AC
-rw---. 1 postgres postgres 16777216 Mar  8 16:49 000601BB00AD
-rw---. 1 postgres postgres 16777216 Mar  8 16:49 000601BB00AE
-rw---. 1 postgres postgres 16777216 Mar  8 16:49 000601BB00AF
-rw---. 1 postgres postgres 16777216 Mar  8 16:49 000601BB00B0
-rw---. 1 postgres postgres 16777216 Mar  8 16:49 000601BB00B1
-rw---. 1 postgres postgres 16777216 Mar  8 16:49 000601BB00B2
-rw---. 1 postgres postgres 16777216 Mar  8 16:50 000601BB00B3
-rw---. 1 postgres postgres 16777216 Mar  8 17:01 000601BB00B4
-rw---. 1 postgres postgres 16777216 Mar  8 17:14 000601BB00B5
-rw---. 1 postgres postgres  218 Mar  8 16:48 0006.history

Any ideas?

Dylan

-Original Message-
From: Michael Paquier [mailto:mich...@paquier.xyz] 
Sent: Tuesday, 6 March 2018 5:55 PM
To: Dylan Luong 
Cc: pgsql-generallists.postgresql.org 
Subject: Re: Resync second slave to new master

On Tue, Mar 06, 2018 at 06:00:40AM +, Dylan Luong wrote:
> So everytime after promoting Slave to become master (either manually
> or automatic), just stop Slave2 and run pg_rewind on slave2 against
> the new maser (old slave1). And when old master server is available
> again, use pg_rewind on that serve as well against new master to
> return to original configuration.

Yes.  That's exactly the idea.  Running pg_rewind on the old master will
be necessary anyway because you need to stop it cleanly once, which will
cause it to generate WAL records at least for the shutdown checkpoint,
while doing it on slave 2 may be optional, still safer to do.
--
Michael



RE: Resync second slave to new master

2018-03-08 Thread Dylan Luong
Thanks! That fixed the issue.

-Original Message-
From: Michael Paquier [mailto:mich...@paquier.xyz] 
Sent: Thursday, 8 March 2018 6:41 PM
To: Yavuz Selim Sertoğlu 
Cc: Dylan Luong ; pgsql-generallists.postgresql.org 

Subject: Re: Resync second slave to new master

On Thu, Mar 08, 2018 at 10:48:29AM +0300, Yavuz Selim Sertoğlu wrote:
> If not set, could you add recovery.conf file
> recovery_target_timeline='latest'
> parameter?
> https://www.postgresql.org/docs/devel/static/recovery-target-settings.html

Yes, that's visibly the issue here.
--
Michael


Point-in-time recovery after failover

2018-03-13 Thread Dylan Luong
Hi

We are on Postgres 9.6 and we have primary/standby wal replication setup for HA.

For PITR, we have scheduled nightly base backup and hourly archive backups on 
the primary server.
https://www.postgresql.org/docs/9.6/static/continuous-archiving.html

I am trying to perform a point-in-time recovery after a failover has occurred.

For example:

1 am archive backup on P1
1:30 am base backup on P1 (primary) (timeline 5)
2 am archive backup on P1
3 am archive backup on P1
Etc..
1:49 pm failover P1 crashed, S1 got promoted. (timeline 6)
2 pm archive backup on S1
3 pm archive backup on S1

For example:
At 3:15pm something got corrupted and need to restore from backup to 1:54pm.

I extracted the base backups (tar files) to the data directory and extracted 
the all the archive backups  (P1 and S1) to the archive directory.
Followed the steps from
Created recovery.conf with the following, but the recovery only recovers, to 
the point before the failover and comes online.

Example of  my recovery.conf :
restore_command = 'cp /pg_backup/backup/archive/%f %p'
recovery_target_time = '2018-03-13 13:54:00'
recovery_target_inclusive = 'false'
recovery_target_timeline = '6'  (I also tried revovery_target_timeline = 
'lastest)

Is it possible to recovery to PITR after a failover using base backup from 
before failover and both archive backups from P1 and S1?


Here is the log output :

018-03-13 20:46:53 ACDT [154912]: [1-1] db=,user= app=,host= LOG:  database 
system was interrupted; last known up at 2018-03-1
3 01:31:25 ACDT
2018-03-13 20:46:53 ACDT [154912]: [2-1] db=,user= app=,host= LOG:  restored 
log file "0006.history" from archive
cp: cannot stat '/pg_backup/backup/archive/0007.history': No such file or 
directory
2018-03-13 20:46:53 ACDT [154912]: [3-1] db=,user= app=,host= LOG:  starting 
point-in-time recovery to 2018-03-13 13:54:00+10:3
0
2018-03-13 20:46:53 ACDT [154912]: [4-1] db=,user= app=,host= LOG:  restored 
log file "0006.history" from archive
cp: cannot stat '/pg_backup/backup/archive/000601110087': No such 
file or directory
2018-03-13 20:46:53 ACDT [154912]: [5-1] db=,user= app=,host= LOG:  restored 
log file "000501110087" from archive
cp: cannot stat '/pg_backup/backup/archive/0005.history': No such file or 
directory
2018-03-13 20:46:53 ACDT [154912]: [6-1] db=,user= app=,host= LOG:  redo starts 
at 111/8728
2018-03-13 20:46:53 ACDT [154912]: [7-1] db=,user= app=,host= LOG:  consistent 
recovery state reached at 111/870B4418
2018-03-13 20:46:53 ACDT [154908]: [3-1] db=,user= app=,host= LOG:  database 
system is ready to accept read only connections
cp: cannot stat '/pg_backup/backup/archive/000601110088': No such 
file or directory
2018-03-13 20:46:53 ACDT [154912]: [8-1] db=,user= app=,host= LOG:  restored 
log file "000501110088" from archive
cp: cannot stat '/pg_backup/backup/archive/000601110089': No such 
file or directory
2018-03-13 20:46:53 ACDT [154912]: [9-1] db=,user= app=,host= LOG:  restored 
log file "000501110089" from archive
cp: cannot stat '/pg_backup/backup/archive/00060111008A': No such 
file or directory
2018-03-13 20:46:53 ACDT [154912]: [10-1] db=,user= app=,host= LOG:  restored 
log file "00050111008A" from archive
cp: cannot stat '/pg_backup/backup/archive/00060111008B': No such 
file or directory
2018-03-13 20:46:53 ACDT [154912]: [11-1] db=,user= app=,host= LOG:  restored 
log file "00050111008B" from archive
cp: cannot stat '/pg_backup/backup/archive/00060111008C': No such 
file or directory
2018-03-13 20:46:53 ACDT [154912]: [12-1] db=,user= app=,host= LOG:  restored 
log file "00050111008C" from archive
cp: cannot stat '/pg_backup/backup/archive/00060111008D': No such 
file or directory
2018-03-13 20:46:53 ACDT [154912]: [13-1] db=,user= app=,host= LOG:  restored 
log file "00050111008D" from archive
cp: cannot stat '/pg_backup/backup/archive/00060111008E': No such 
file or directory
2018-03-13 20:46:53 ACDT [154912]: [14-1] db=,user= app=,host= LOG:  restored 
log file "00050111008E" from archive
cp: cannot stat '/pg_backup/backup/archive/00060111008F': No such 
file or directory
2018-03-13 20:46:54 ACDT [154912]: [15-1] db=,user= app=,host= LOG:  restored 
log file "00050111008F" from archive
cp: cannot stat '/pg_backup/backup/archive/000601110090': No such 
file or directory
2018-03-13 20:46:54 ACDT [154912]: [16-1] db=,user= app=,host= LOG:  restored 
log file "000501110090" from archive
cp: cannot stat '/pg_backup/backup/archive/000601110091': No such 
file or directory
2018-03-13 20:46:54 ACDT [154912]: [17-1] db=,user= app=,host= LOG:  restored 
log file "000501110091" from archive
cp: cannot stat '/pg_backup/backup/archive/000601110092': No such 
file or directory

help with startup slave after pg_rewind

2018-09-19 Thread Dylan Luong

Hi
After promoting slave to master, I completed a pg_rewind of the slave (old 
master) to the new master. But when I try to start the slave I am getting the 
following error.

2018-09-20 07:53:51 ACST [20265]: [2-1] db=[unknown],user=replicant 
app=[unknown],host=10.69.20.22(51271) FATAL:  the database system is starting up
2018-09-20 07:53:51 ACST [20264]: [3-1] db=,user= app=,host= LOG:  restored log 
file "000C.history" from archive
2018-09-20 07:53:51 ACST [20264]: [4-1] db=,user= app=,host= LOG:  restored log 
file "000C085B" from archive
2018-09-20 07:53:51 ACST [20264]: [5-1] db=,user= app=,host= LOG:  contrecord 
is requested by 85B/28
2018-09-20 07:53:51 ACST [20268]: [1-1] db=,user= app=,host= LOG:  started 
streaming WAL from primary at 85B/0 on timeline 12
2018-09-20 07:53:51 ACST [20264]: [6-1] db=,user= app=,host= LOG:  contrecord 
is requested by 85B/28
2018-09-20 07:53:51 ACST [20268]: [2-1] db=,user= app=,host= FATAL:  
terminating walreceiver process due to administrator command
2018-09-20 07:53:51 ACST [20264]: [7-1] db=,user= app=,host= LOG:  restored log 
file "000C085B" from archive
2018-09-20 07:53:51 ACST [20264]: [8-1] db=,user= app=,host= LOG:  contrecord 
is requested by 85B/28
2018-09-20 07:53:51 ACST [20264]: [9-1] db=,user= app=,host= LOG:  contrecord 
is requested by 85B/28
2018-09-20 07:53:51 ACST [20264]: [10-1] db=,user= app=,host= LOG:  restored 
log file "000C085B" from archive
2018-09-20 07:53:51 ACST [20264]: [11-1] db=,user= app=,host= LOG:  contrecord 
is requested by 85B/28
2018-09-20 07:53:51 ACST [20264]: [12-1] db=,user= app=,host= LOG:  contrecord 
is requested by 85B/28


I tried to run pg_rewind again, but now it says I cannot do it as its already 
same timeline.

Regards
Dylan



clean out ./data/base/pgsql_tmp

2017-12-06 Thread Dylan Luong
Hi

We have an issue where one of the developers ran a large query that hung was 
filling up the DATA drive very rapidly. From 50% usage to 95% in less than 2hrs.
It created a very large pgsql_tmp size (300GB). To stop the drive filling up we 
had to kill the process manually using kill -9.
How do we clean up the pgsql_tmp folder? Will Postgres periodically clean it? 
Ie CHECKPOINT?
Thanks.

Regards
Dylan


RE: clean out ./data/base/pgsql_tmp

2017-12-06 Thread Dylan Luong
Since the temp files are easily identifiable as it has the PID in the 
fileaname. 
Is it ok just manually deleting these files as the process has already being 
killed.

-Original Message-
From: Thomas Munro [mailto:thomas.mu...@enterprisedb.com] 
Sent: Thursday, 7 December 2017 2:56 PM
To: Dylan Luong 
Cc: pgsql-general@lists.postgresql.org
Subject: Re: clean out ./data/base/pgsql_tmp

On Thu, Dec 7, 2017 at 5:18 PM, Dylan Luong  wrote:
> We have an issue where one of the developers ran a large query that 
> hung was filling up the DATA drive very rapidly. From 50% usage to 95% 
> in less than 2hrs.
>
> It created a very large pgsql_tmp size (300GB). To stop the drive 
> filling up we had to kill the process manually using kill -9.
>
> How do we clean up the pgsql_tmp folder? Will Postgres periodically 
> clean it? Ie CHECKPOINT?

Hi Dylan,

A clean shutdown/restart will clean it out.  A crash restart (as happens if you 
kill processes like that) doesn't clear away pgsql_tmp data on the theory that 
it might be useful for forensics.

--
Thomas Munro
http://www.enterprisedb.com


Missing WAL file after running pg_rewind

2018-01-11 Thread Dylan Luong
Hi

We had a failover situation where our monitoring watchdog processes promoted 
the slave to become the new master.
I restarted the old master database to ensure a clean stop/start and performed 
pg_rewind on the old master to resync with the new master. However, after 
successful rewind, there was an error restarting the new slave.
The steps I took were:

1.   Stop all watchdogs

2.   Start/stop the old master

3.   Run 'checkpoint' on new master

4.   Run the pg_rewind on old master to resync with new master

5.   Start the old master (as new slave)

Step 4 pg_rewind was successful with the new slave rewind to the same new 
timeline of the new master, however during the restart of the new slave it 
failed to start with the following errors:

80) FATAL:  the database system is starting up
cp: cannot stat '/pg_backup/backup/archive_sync/0004038300BF': No 
such file or directory
cp: cannot stat '/pg_backup/backup/archive_sync/0003038300BF': No 
such file or directory
cp: cannot stat '/pg_backup/backup/archive_sync/0002038300BF': No 
such file or directory
cp: cannot stat '/pg_backup/backup/archive_sync/0001038300BF': No 
such file or directory
2018-01-11 23:21:59 ACDT [112235]: [1-1] db=,user= app=,host= LOG:  started 
streaming WAL from primary at
383/BE00 on timeline 6
2018-01-11 23:21:59 ACDT [112235]: [2-1] db=,user= app=,host= FATAL:  could not 
receive data from WAL stre
am: ERROR:  requested WAL segment 0006038300BE has already been 
removed

I checked the both the archive and pg_xlog directories on the new master and 
cannot locate missing file.

Has anyone experience this before with pg_rewind?

The earliest wall files in the archive directory was around just after the 
failover occurred.

Eg, in the archive directory on the new Master:
$ ls -l
total 15745032
-rw---. 1 postgres postgres 16777216 Jan 11 17:52 
0005038300C0.partial
-rw---. 1 postgres postgres 16777216 Jan 11 17:52 0006038300C0
-rw---. 1 postgres postgres 16777216 Jan 11 17:52 0006038300C1
-rw---. 1 postgres postgres 16777216 Jan 11 17:52 0006038300C2
-rw---. 1 postgres postgres 16777216 Jan 11 17:52 0006038300C

And on the pg_xlog directory on the new Master:
-rw---. 1 postgres postgres 16777216 Jan 11 18:57 000603850080
-rw---. 1 postgres postgres 16777216 Jan 11 18:57 000603850081
-rw---. 1 postgres postgres 16777216 Jan 11 18:57 000603850082
-rw---. 1 postgres postgres 16777216 Jan 11 18:57 000603850083
-rw---. 1 postgres postgres 16777216 Jan 11 18:57 000603850084
-rw---. 1 postgres postgres 16777216 Jan 11 18:57 000603850085
-rw---. 1 postgres postgres 16777216 Jan 11 18:57 000603850086
-rw---. 1 postgres postgres 16777216 Jan 11 18:57 000603850087

Thanks
Dylan



RE: Missing WAL file after running pg_rewind

2018-01-12 Thread Dylan Luong
The file  exist in the archive directory of the old master but it is for the 
previous timeline, ie 5 and not 6, ie 0005038300BE.
Can I just rename the file to 6 timeline? Ie 0006038300BE

-Original Message-
From: Michael Paquier [mailto:michael.paqu...@gmail.com] 
Sent: Friday, 12 January 2018 12:08 PM
To: Dylan Luong 
Cc: pgsql-general@lists.postgresql.org
Subject: Re: Missing WAL file after running pg_rewind

On Thu, Jan 11, 2018 at 04:58:02PM +, Dylan Luong wrote:
> The steps I took were:
> 
> 1.   Stop all watchdogs
> 
> 2.   Start/stop the old master
> 
> 3.   Run 'checkpoint' on new master
> 
> 4.   Run the pg_rewind on old master to resync with new master
> 
> 5.   Start the old master (as new slave)

That's a sane flow to me.

> 2018-01-11 23:21:59 ACDT [112235]: [2-1] db=,user= app=,host= FATAL:  
> could not receive data from WAL stre
> am: ERROR:  requested WAL segment 0006038300BE has already 
> been removed
> 
> Has anyone experience this before with pg_rewind?

When restarting a standby after a rewind has been done to it, note that, in 
order to recover to a consistent point, it needs to replay WAL from the 
previous checkpoint checkpoint where WAL has forked during the promotion up to 
the point where the rewind has finished. Per your logs, I am getting that the 
previous checkpoint before the timeline jump is located in segment 
000X038300BE, but this did not get archived.

> The earliest wall files in the archive directory was around just after the 
> failover occurred.
> 
> Eg, in the archive directory on the new Master:
> $ ls -l
> total 15745032
> -rw---. 1 postgres postgres 16777216 Jan 11 17:52 
> 0005038300C0.partial -rw---. 1 postgres postgres 
> 16777216 Jan 11 17:52 0006038300C0 -rw---. 1 postgres 
> postgres 16777216 Jan 11 17:52 0006038300C1 -rw---. 1 
> postgres postgres 16777216 Jan 11 17:52 0006038300C2

Yeah, you are looking for the WAL segment just before the last, partial WAL 
segment of the previous timeline. Depending on your archiving strategy, I guess 
that you should have set archive_mode = 'always' so as the server which was the 
standby before the promotion is also able to store them.
--
Michael



RE: Missing WAL file after running pg_rewind

2018-01-14 Thread Dylan Luong
The content of the history file 0006.history is:

$ more 0006.history
1   2CE/8A98no recovery target specified
2   2CE/FF974EF0no recovery target specified
3   2CF/5198no recovery target specified
4   2D1/C90ACD88no recovery target specified
5   383/C0790E50no recovery target specified

Here are the last few archive file of the old timeline (old master):

0005038300BD
0005038300BE
0005038300BF
0005038300C0
0005038300C1

Here are the first few archive files of the new timeline (new master):
0005038300C0.partial
0006038300C0
0006038300C1
0006038300C2
0006038300C3

Looks like it has folked at C0. 
But why is the new slave asking for 0006038300BE on timeline during 
the restore after the pg_rewind? And not from C0?

Dylan


-Original Message-
From: Michael Paquier [mailto:michael.paqu...@gmail.com] 
Sent: Saturday, 13 January 2018 9:04 PM
To: Dylan Luong 
Cc: pgsql-general@lists.postgresql.org
Subject: Re: Missing WAL file after running pg_rewind

On Fri, Jan 12, 2018 at 09:44:25PM +, Dylan Luong wrote:
> The file  exist in the archive directory of the old master but it is 
> for the previous timeline, ie 5 and not 6, ie 
> 0005038300BE. Can I just rename the file to 6 timeline? Ie 
> 0006038300BE

What are the contents of the history file for this new timeline? You are 
looking at 0006.history which should be archived as well. You could do that 
assuming that WAL has forked on this segment at promotion as both segments 
would have the same contents up to the point where WAL has forked.
--
Michael



Performance issues during backup

2018-02-19 Thread Dylan Luong
Hi

We perform nighty base backup of our production PostgreSQL instance. We have a 
script that basically puts the instance into back mode and then backs up (tar) 
the /Data directory and then takes it out of backup mode.
Ie,
psql -c "SELECT pg_start_backup('${DATE}');"
tar -cvf - ${DATA_DIR} --exclude ${DATA_DIR}/pg_log | split -d -b 
$TAR_SPLIT_SIZE - ${BACKUP_DIR}/${BACKUP_NAME}
psql -c "SELECT pg_stop_backup();"

The size of our database is about 250GB and it usually takes about 1 hour to 
backup.
During this time, we have performance issue where queries can take up to 15secs 
to return where normally it takes 2 to 3 seconds.
During this time (1:30am) usage is low (less than 10 users) on the system.

Has anyone experience the same problem and any suggestions where to look at to 
resolve the problem?

Thanks.
Dylan




RE: Performance issues during backup

2018-02-22 Thread Dylan Luong
Hi
Thanks for the rely.
I have trialed the ionice -c 2 -n 7 tar…. change to our backup script and it 
appears to have helped but not by much.
The affected queries are more of the update/delete/insert queries. Could 
pg_start_backup be causing locking of some sort.
Regards
Dylan

From: Rene Romero Benavides [mailto:rene.romer...@gmail.com]
Sent: Wednesday, 21 February 2018 1:37 AM
To: Laurenz Albe 
Cc: Dylan Luong ; pgsql-general@lists.postgresql.org
Subject: Re: Performance issues during backup

What about sending the backup to a different server? through ssh / rsync or 
something, that would save lots of IO activity

2018-02-20 2:02 GMT-06:00 Laurenz Albe 
mailto:laurenz.a...@cybertec.at>>:
Dylan Luong wrote:
> We perform nighty base backup of our production PostgreSQL instance. We have 
> a script that basically puts the instance
> into back mode and then backs up (tar) the /Data directory and then takes it 
> out of backup mode.
> Ie,
> psql -c "SELECT pg_start_backup('${DATE}');"
> tar -cvf - ${DATA_DIR} --exclude ${DATA_DIR}/pg_log | split -d -b 
> $TAR_SPLIT_SIZE - ${BACKUP_DIR}/${BACKUP_NAME}
> psql -c "SELECT pg_stop_backup();"
>
> The size of our database is about 250GB and it usually takes about 1 hour to 
> backup.
> During this time, we have performance issue where queries can take up to 
> 15secs to return where normally it takes 2 to 3 seconds.
> During this time (1:30am) usage is low (less than 10 users) on the system.
>
> Has anyone experience the same problem and any suggestions where to look at 
> to resolve the problem?

The "tar" is probably taking up too much I/O bandwidth.

Assuming this is Linux, you could run it with

  ionice -c 2 -n 7 tar ...

or

  ionice -c 3 tar ...

Of course then you can expect the backup to take more time.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com



--
El genio es 1% inspiración y 99% transpiración.
Thomas Alva Edison
http://pglearn.blogspot.mx/