Re: Fwd: increase insert into local table from remote oracle table preformance

2018-08-14 Thread Mariel Cherkassky
Hi,
I'll try to answer all your question so that you will have more information
about the situation :

I have one main table that is called main_table_hist. The "main_table _hist"
is partitioned by range (date column) and includes data that is considered
as "history data" . I'm trying to copy the data from the oracle table to my
local postgresql table (about 5T). For every day in the year I have in the
oracle table partition and therefore I will create for every day in year
(365 in total) a partition in postgresql. Every partition of day consist of
4 different partitions by list (text values). So In total my tables
hierarchy should look like that :
main_table_hist
 14/08/2018_main
   14/08/2018_value1
   14/08/2018_value2
   14/08/2018_value3
   14/08/2018_value1

Moreover, I have another table that is called "present_data" that consist
of 7 partitions (the data of the last 7 days - 300G) that I'm loading  from
csv files (daily). Every night I need to deattach the last day partition
and attach it to the history table.

This hierarchy works well in oracle and I'm trying to build it on
postgresql. Right now I'm trying to copy the history data from the remote
database but as I suggested it takes 10 hours for 200G.

Some details :
-Seting the wals to minimum is possible but I cant do that as a daily work
around because that means restarting the database.
 I must have wals generated in order to restore the "present_data" in case
of disaster.
-The network
-My network bandwidth is 1GB.
-The column in the table are from types : character varying,big
int,timestamp,numeric. In other words no blobs.
-I have many check constraints on the table.
- Laurenz - "You could try a bigger value for the "prefetch" option."- Do
you have an example how to do it ?
-Inserting directly into the right parittion might increase the preformance
?

Thanks , Mariel.


2018-08-14 0:03 GMT+03:00 legrand legrand :

> Did you try
> - runing multiple inserts in parallel,
> - Stop wal archiving,
> - Tune fetch sise ?
>
> Regards
> PAscal
>
>
>
>
> --
> Sent from: http://www.postgresql-archive.org/PostgreSQL-performance-
> f2050081.html
>
>


Bi-modal streaming replication throughput

2018-08-14 Thread Alexis Lê-Quôc
Hi,

I have been puzzled by very different replication performance (meaning
50-100x slower) between identical replicas (both in “hardware” and
configuration) once the amount of data to replicate increases. I’ve gone
down a number of dead ends and am missing something
(
likely obvious
)
that I hope folks with a deeper knowledge can point out. I’ve tried to boil
down the data need to describe the issue to a minimum.
 Thanks for taking the time to read and for any ideas you can share.

# The setup

We run
a cluster of
large, SSD-backed, i3.16xl (64 cores visible to Linux, ~500GB of RAM, with
8GB of shared_buffers, fast NVMe drives) nodes
, each
running PG 9.3
on linux
in a vanilla streaming asynchronous replication setup: 1 primary node, 1
replica designated for failover (left alone) and 6 read replicas, taking
queries.

Under normal circumstances this is working exactly as planned but when I
dial up the number of INSERTs on the primary to ~10k rows per second, or
roughly 50MB of data per second (not enough to saturate the network between
nodes)
, read replicas falls hopelessly and consistently behind until read traffic
is diverted away
. The INSERTs themselves are fairly straightforward: a 20-bytea checksum is
computed off-node
and used as a unicity constraint at insert time. Each record is 4,500 bytes
wide on average.

H
ere’s the table where inserts happen.

  Table “T”
 Column |Type |   Modifiers
   | Storage  |
+-++--+
 key| bigint  | not null default
T.next_key()

| plain|
 a  | integer | not null
| plain|
 b  | integer |
   | plain|
 c  | text|
   | extended |
 d  | text|
   | extended |
 e  | text[]  |
   | extended |
 f  | integer | not null
| plain|
 created| timestamp without time zone | not null default now()
| plain|
 cksum  | bytea   | not null
| extended |
Indexes:
“T_pkey" PRIMARY KEY, btree (key)
“T_cksum” UNIQUE, btree (cksum)
“T_created_idx" btree (created)
“T_full_idx" btree (a, b, c, d, e)
“T_a_idx" btree (a)


# The symptoms

Once the primary starts to process INSERTs to the tune of 10k/s (roughly
5
0MB/s or 150GB/h), replication throughput becomes bi-modal
 within minutes.

1. We see read replicas fall behind and we can measure their replication
throughput to be
consistently
1-2% of what the primary is sustaining, by measuring the replication delay
(in second) every second. We quickly get
that metric
to 0.98-0.99 (1 means that replication is completely stuck
as it falls behind by one second every second
). CPU, memory
, I/O
(per core iowait)
or network
(throughput)
as a whole resource are not
visibly
maxed out
.

2. If we stop incoming queries from one of the replicas, we see it catch up
at 2x insert throughput (roughly 80MB/s or 300GB/h) as it is cutting
through the backlog. A perf sample shows a good chunk of time spent in
`mdnblocks`. I/O wait remains
at
a few %
(2-10) of cpu cycles. If you can open the attached screenshot you can see
the lag going down on each replica as soon as we stop sending reads at it.


In both cases the recovery process maxes out 1 core
as expected
.

# The question

What surprised me is the bi-modal nature of throughput without gradual
degradation
or a very clear indication of the contentious resource (I/O? Buffer access?)
.
The bi-modal throughput
 would be consistent with replication being
effectively
scheduled to run
at full speed
1% or 2% of the time (the rest being allocated to queries) but I have not
found something in the documentation or in the code that
supports that view.

Is this the right way to think about what’s observed?
If not, what could be a good next hypothesis to test?


# References

Here are some settings that may help and a perf profile of a recovery
process that runs without any competing read traffic processing the INSERT
backlog (I don't unfortunately have the same profile on a lagging read
replica).

 name |  setting
--+---
 max_wal_senders  | 299
 max_wal_size | 10240
 min_wal_size | 5
 wal_block_size   | 8192
 wal_buffers  | 2048
 wal_compression  | off
 wal_keep_segments| 0
 wal_level| replica
 wal_log_hints| off
 wal_receiver_status_interval | 10
 wal_receiver_timeout | 6
 wal_retrieve_retry_interval  | 5000
 wal_segment_size | 2048
 wal_sender_ti

Re: Bi-modal streaming replication throughput

2018-08-14 Thread Jeff Janes
On Tue, Aug 14, 2018 at 9:18 AM, Alexis Lê-Quôc  wrote:

>
each
 running PG 9.3
 on linux


That is the oldest version which is still supported.  There have been a lot
of improvements since then, including to performance.  You should see if an
upgrade solves the problem.  If not, at least you will have access to
better tools (like pg_stat_activity.wait_event_type), and people will be
more enthusiastic about helping you figure it out knowing it is not an
already-solved problem.


>
> Here are some settings that may help and a perf profile of a recovery
> process that runs without any competing read traffic processing the INSERT
> backlog (I don't unfortunately have the same profile on a lagging read
> replica).
>

Unfortunately the perf when the problem is not occuring won't be very
helpful.  You need it from when the problem is occurring.  Also, I find
strace and gdb to more helpful than perf in this type of situation where
you already know it is not CPU bound, although perhaps that is just my own
lack of skill with perf. You need to know why it is not on the CPU, not
what it is doing when it is on the CPU.

Where the settings you showed all of the non-default settings?

I assume max_standby_streaming_delay is at the default value of 30s?  Are
you getting query cancellations due conflicts with recovery, or anything
else suspicious in the log?  What is the maximum lag you see measured in
seconds?

Cheers,

Jeff


Re: Bi-modal streaming replication throughput

2018-08-14 Thread Andres Freund
Hi,

On 2018-08-14 15:18:55 +0200, Alexis Lê-Quôc wrote:
> We run
> a cluster of
> large, SSD-backed, i3.16xl (64 cores visible to Linux, ~500GB of RAM, with
> 8GB of shared_buffers, fast NVMe drives) nodes
> , each
> running PG 9.3
> on linux
> in a vanilla streaming asynchronous replication setup: 1 primary node, 1
> replica designated for failover (left alone) and 6 read replicas, taking
> queries.

9.3 is extremely old, we've made numerous performance improvements in
areas potentially related to your problem.


> Under normal circumstances this is working exactly as planned but when I
> dial up the number of INSERTs on the primary to ~10k rows per second, or
> roughly 50MB of data per second (not enough to saturate the network between
> nodes)
> , read replicas falls hopelessly and consistently behind until read traffic
> is diverted away
> .

Do you use hot_standby_feedback=on?



> 1. We see read replicas fall behind and we can measure their replication
> throughput to be
> consistently
> 1-2% of what the primary is sustaining, by measuring the replication delay
> (in second) every second. We quickly get
> that metric
> to 0.98-0.99 (1 means that replication is completely stuck
> as it falls behind by one second every second
> ). CPU, memory
> , I/O
> (per core iowait)
> or network
> (throughput)
> as a whole resource are not
> visibly
> maxed out

Are individual *cores* maxed out however? IIUC you're measuring overall
CPU util, right? Recovery (streaming replication apply) is largely
single threaded.


> Here are some settings that may help and a perf profile of a recovery
> process that runs without any competing read traffic processing the INSERT
> backlog (I don't unfortunately have the same profile on a lagging read
> replica).

Unfortunately that's not going to help us much identifying the
contention...


> +   30.25%26.78%  postgres  postgres   [.] mdnblocks

This I've likely fixed ~two years back:

http://archives.postgresql.org/message-id/72a98a639574d2e25ed94652848555900c81a799


> +   18.64%18.64%  postgres  postgres   [.] 0x000fde6a

Hm, too bad that this is without a symbol - 18% self is quite a
bit. What perf options are you using?


> +4.74% 4.74%  postgres  [kernel.kallsyms]  [k]
> copy_user_enhanced_fast_string

Possible that a slightly bigger shared buffer would help you.

It'd probably more helpful to look at a perf report --no-children for
this kind of analysis.

Greetings,

Andres Freund



Re: Bi-modal streaming replication throughput

2018-08-14 Thread Andres Freund
Hi,

On 2018-08-14 10:46:45 -0700, Andres Freund wrote:
> On 2018-08-14 15:18:55 +0200, Alexis Lê-Quôc wrote:
> > +   30.25%26.78%  postgres  postgres   [.] mdnblocks
> 
> This I've likely fixed ~two years back:
> 
> http://archives.postgresql.org/message-id/72a98a639574d2e25ed94652848555900c81a799

Err, wrong keyboard shortcut *and* wrong commit hash:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=45e191e3aa62d47a8bc1a33f784286b2051f45cb

- Andres



Calculating how much redo log space has been used

2018-08-14 Thread Reza Taheri
Hello PostgreSQL community,
I am helping with a benchmarking exercise using PGSQL (I chair the TPC 
subcommittee that has released a 
benchmark using PGSQL). A requirement of the benchmark is having enough log 
space allocated for 8 hours of running without needing to archive, back up, 
etc. I am trying to a) figure out how I can establish the exact space usage for 
the auditor; and b) how I can reduce the log space usage. Looking at iostat and 
pgstatspack, it looks like we will need to allocate something like 1.5TB of log 
space for a 5TB database, which is a huge ratio. (Yes, in the real world, we’d 
probably archive or ship the logs; but for benchmarking, that doesn’t work)

pgstatspack gives me something like below:


background writer stats

 checkpoints_timed | checkpoints_req | buffers_checkpoint | buffers_clean | 
maxwritten_clean | buffers_backend | buffers_alloc

---+-++---+--+-+---

22 |   0 |6416768 |   2252636 | 
   0 |  280211 |   9786558

(1 row)





background writer relative stats

 checkpoints_timed | minutes_between_checkpoint | buffers_checkpoint | 
buffers_clean | buffers_backend | total_writes | avg_checkpoint_write

---+++---+-+--+--

 100%  |  6 | 71%| 25%  
 | 3%  | 8.659 MB/s   | 2278.000 MB

I can calculate how many checkpoint segments I have used from the MB/s. But is 
there a more direct way of seeing how/when a checkpoint segment is filled up 
and we move on to the next one?

Also, it looks like the full_page_writes parameter is the only thing that can 
help reduce the log usage size, but that I have to set it to 1 to avoid 
corruption after a system crash, which is a requirement. Another requirement is 
a very short, 6-minute checkpoint time, which means we will likely write the 
full page very often. Yes, my hands are tied!

Here are the relevant non-default settings:


shared_buffers = 18000MB   # min 128kB

temp_buffers = 2MB # min 800kB

maintenance_work_mem = 5MB # min 1MB

bgwriter_delay = 10ms  # 10-1ms between rounds

bgwriter_lru_maxpages = 200# 0-1000 max buffers written/round

effective_io_concurrency = 10 # 1-1000; 0 disables prefetching

wal_sync_method = open_datasync   # the default is the first option

wal_buffers = 16MB # min 32kB, -1 sets based on shared_buffers

wal_writer_delay = 10ms   # 1-1 milliseconds

checkpoint_segments = 750  # in logfile segments, min 1, 16MB each

checkpoint_timeout = 6min  # range 30s-1h

checkpoint_completion_target = 0.9 # checkpoint target duration, 0.0 - 1.0

effective_cache_size = 512MB

default_statistics_target = 1  # range 1-1

log_destination = 'stderr' # Valid values are combinations of

logging_collector = on# Enable capturing of stderr and 
csvlog

log_directory = 'pg_log'   # directory where log files are written,

log_filename = 'postgresql-%a.log' # log file name pattern,

log_truncate_on_rotation = on # If on, an existing log file with the

log_rotation_age = 1d  # Automatic rotation of logfiles will

log_rotation_size = 0  # Automatic rotation of logfiles will

log_checkpoints = on



Re: Calculating how much redo log space has been used

2018-08-14 Thread Andres Freund
Hi,

On 2018-08-14 18:51:34 +, Reza Taheri wrote:
> Also, it looks like the full_page_writes parameter is the only thing
> that can help reduce the log usage size

There's also wal_compression.


> Another requirement is a very short, 6-minute checkpoint time, which
> means we will likely write the full page very often. Yes, my hands are
> tied!

Why is that a requirement / how is specifically phrased? Is it a bounded
recovery time?

Greetings,

Andres Freund



Re: Fwd: increase insert into local table from remote oracle table preformance

2018-08-14 Thread legrand legrand
main ideas are:

- inserting directly to the right partition:
  perform as many inserts as pg partitions found in main_table_hist, like
  INSERT INTO 14/08/2018_value1 select * from remote_oracle_hist where
day=to_date('14/08/2018','DD/MM/') and value='value1'

please check execution plan (in Oracle db) using EXPLAIN ANALYZE

- all those inserts should be executed in // (with 4 or 8 sql scripts)

- wal archiving should be disabled during hist data recovery only (not
during day to day operations)

- for prefetch see

https://github.com/laurenz/oracle_fdw

prefetch (optional, defaults to "200")

Sets the number of rows that will be fetched with a single round-trip
between PostgreSQL and Oracle during a foreign table scan. This is
implemented using Oracle row prefetching. The value must be between 0 and
10240, where a value of zero disables prefetching.

Higher values can speed up performance, but will use more memory on the
PostgreSQL server.


Regards
PAscal



--
Sent from: 
http://www.postgresql-archive.org/PostgreSQL-performance-f2050081.html