pg12 partitions show bad performance vs pg96

2020-03-08 Thread Mariel Cherkassky
Hey,
I upgraded from 96 to 12 in our test env and I'm seeing that for queries
that involve join operation between a partition table and other tables
there is degradation is performance compared to pg96 performance.

My machine : 8cpu,16gb,regular hd,linux redhat 6
pg settings :
max_wal_size = 2GB
min_wal_size = 1GB
wal_buffers = 16MB
checkpoint_completion_target = 0.9
checkpoint_timeout = 30min
log_checkpoints = on
log_lock_waits = on
log_temp_files = 1024
log_min_duration_statement = 1
log_autovacuum_min_duration = 1
standard_conforming_strings = off
max_locks_per_transaction = 5000
max_connections = 500
log_line_prefix = '%t %d %p  '
random_page_cost = 4
deadlock_timeout = 5s
shared_preload_libraries = 'pg_stat_statements'
track_activity_query_size = 32764
log_directory = 'pg_log'
enable_partitionwise_join = on # for pg12
enable_partitionwise_aggregate = on  # for pg12
listen_addresses =  '*'
ssl =  on
maintenance_work_mem = 333MB
work_mem = 16MB
shared_buffers = 4020MB
effective_cache_size = 8040MB

postgresql12.2

I used this table as the joined table for both cases :
 create table iot_device(id serial primary key,name text);
insert into iot_device(name) select generate_series(1,100)||'a';

In pg96 I created the following regular table :
create table iot_data(id serial primary key,data text,metadata
bigint,device bigint references iot_device(id));

inserted the data :
  insert into iot_data select
generate_series(1,1000),random()*10,random()*254,random()*99+1;

In pg12 I created  a table with 3 hash partitiones :
create table iot_data(id serial ,data text,metadata bigint,device bigint
references iot_device(id),primary key(id,device)) partition by hash(device);
create table iot_data_0 partition of iot_data for values with (MODULUS 3,
remainder 0);
create table iot_data_1 partition of iot_data for values with (MODULUS 3,
remainder 1);
create table iot_data_2 partition of iot_data for values with (MODULUS 3,
remainder 2);


I generated a dump of the data in the pg96 machine and inserted it into the
pg12 db :
pg_dump -d postgres -U postgres -a -t iot_data > iot_data.dump
psql -d postgres -U postgres -f  -h pg12_machine /tmp/iot_data.dump

postgres=# select count(*) from iot_data_0;
  count
-
 3028682
(1 row)

postgres =# select count(*) from iot_data_1;
  count
-
 3234335
(1 row)

postgres =# select count(*) from iot_data_2;
  count
-
 3736983
(1 row)

create index on iot_data(metadata,lower(data));
vacuum analyze iot_data;

and now for the performance:
query :  explain analyze select * from iot_data da,iot_device de where
de.name in ('50a') and de.id = da.device and da.metadata=50 and
lower(da.data) like '50';

PG12 :
QUERY
PLAN
---
 Nested Loop  (cost=5.16..773.61 rows=2 width=43) (actual time=2.858..2.858
rows=0 loops=1)
   ->  Seq Scan on iot_device de  (cost=0.00..2.25 rows=1 width=7) (actual
time=0.014..0.020 rows=1 loops=1)
 Filter: (name = '50a'::text)
 Rows Removed by Filter: 99
   ->  Append  (cost=5.16..771.30 rows=6 width=36) (actual
time=2.835..2.835 rows=0 loops=1)
 ->  Bitmap Heap Scan on iot_data_0 da  (cost=5.16..233.78 rows=2
width=36) (actual time=2.829..2.829 rows=0 loops=1)
   Recheck Cond: (metadata = 50)
   Filter: ((de.id = device) AND (lower(data) ~~ '50'::text))
   ->  Bitmap Index Scan on iot_data_0_metadata_lower_idx
 (cost=0.00..5.14 rows=59 width=0) (actual time=2.827..2.827 rows=0 loops=1)
 Index Cond: ((metadata = 50) AND (lower(data) =
'50'::text))
 ->  Bitmap Heap Scan on iot_data_1 da_1  (cost=5.20..249.32 rows=2
width=37) (never executed)
   Recheck Cond: (metadata = 50)
   Filter: ((de.id = device) AND (lower(data) ~~ '50'::text))
   ->  Bitmap Index Scan on iot_data_1_metadata_lower_idx
 (cost=0.00..5.18 rows=63 width=0) (never executed)
 Index Cond: ((metadata = 50) AND (lower(data) =
'50'::text))
 ->  Bitmap Heap Scan on iot_data_2 da_2  (cost=5.30..288.16 rows=2
width=36) (never executed)
   Recheck Cond: (metadata = 50)
   Filter: ((de.id = device) AND (lower(data) ~~ '50'::text))
   ->  Bitmap Index Scan on iot_data_2_metadata_lower_idx
 (cost=0.00..5.29 rows=73 width=0) (never executed)
 Index Cond: ((metadata = 50) AND (lower(data) =
'50'::text))
 Planning Time: 8.157 ms
 Execution Time: 2.920 ms
(22 rows)


PG96 :
QUERY PLAN

 Nested Loop  (cost=6.57..397.19 rows=2 width=44) (actual time=0.121..0.121
rows=0 loops=

Re: pg12 partitions show bad performance vs pg96

2020-03-08 Thread Mariel Cherkassky
I realized that the planner goes to the right partition because "(never
executed)" is mentioned near the scan of the other partitions. However,
still i'm not sure why performance is better in pg96.

‫בתאריך יום א׳, 8 במרץ 2020 ב-18:05 מאת ‪Mariel Cherkassky‬‏ <‪
[email protected]‬‏>:‬

> Hey,
> I upgraded from 96 to 12 in our test env and I'm seeing that for queries
> that involve join operation between a partition table and other tables
> there is degradation is performance compared to pg96 performance.
>
> My machine : 8cpu,16gb,regular hd,linux redhat 6
> pg settings :
> max_wal_size = 2GB
> min_wal_size = 1GB
> wal_buffers = 16MB
> checkpoint_completion_target = 0.9
> checkpoint_timeout = 30min
> log_checkpoints = on
> log_lock_waits = on
> log_temp_files = 1024
> log_min_duration_statement = 1
> log_autovacuum_min_duration = 1
> standard_conforming_strings = off
> max_locks_per_transaction = 5000
> max_connections = 500
> log_line_prefix = '%t %d %p  '
> random_page_cost = 4
> deadlock_timeout = 5s
> shared_preload_libraries = 'pg_stat_statements'
> track_activity_query_size = 32764
> log_directory = 'pg_log'
> enable_partitionwise_join = on # for pg12
> enable_partitionwise_aggregate = on  # for pg12
> listen_addresses =  '*'
> ssl =  on
> maintenance_work_mem = 333MB
> work_mem = 16MB
> shared_buffers = 4020MB
> effective_cache_size = 8040MB
>
> postgresql12.2
>
> I used this table as the joined table for both cases :
>  create table iot_device(id serial primary key,name text);
> insert into iot_device(name) select generate_series(1,100)||'a';
>
> In pg96 I created the following regular table :
> create table iot_data(id serial primary key,data text,metadata
> bigint,device bigint references iot_device(id));
>
> inserted the data :
>   insert into iot_data select
> generate_series(1,1000),random()*10,random()*254,random()*99+1;
>
> In pg12 I created  a table with 3 hash partitiones :
> create table iot_data(id serial ,data text,metadata bigint,device bigint
> references iot_device(id),primary key(id,device)) partition by hash(device);
> create table iot_data_0 partition of iot_data for values with (MODULUS 3,
> remainder 0);
> create table iot_data_1 partition of iot_data for values with (MODULUS 3,
> remainder 1);
> create table iot_data_2 partition of iot_data for values with (MODULUS 3,
> remainder 2);
>
>
> I generated a dump of the data in the pg96 machine and inserted it into
> the pg12 db :
> pg_dump -d postgres -U postgres -a -t iot_data > iot_data.dump
> psql -d postgres -U postgres -f  -h pg12_machine /tmp/iot_data.dump
>
> postgres=# select count(*) from iot_data_0;
>   count
> -
>  3028682
> (1 row)
>
> postgres =# select count(*) from iot_data_1;
>   count
> -
>  3234335
> (1 row)
>
> postgres =# select count(*) from iot_data_2;
>   count
> -
>  3736983
> (1 row)
>
> create index on iot_data(metadata,lower(data));
> vacuum analyze iot_data;
>
> and now for the performance:
> query :  explain analyze select * from iot_data da,iot_device de where
> de.name in ('50a') and de.id = da.device and da.metadata=50 and
> lower(da.data) like '50';
>
> PG12 :
> QUERY
> PLAN
>
> ---
>  Nested Loop  (cost=5.16..773.61 rows=2 width=43) (actual
> time=2.858..2.858 rows=0 loops=1)
>->  Seq Scan on iot_device de  (cost=0.00..2.25 rows=1 width=7) (actual
> time=0.014..0.020 rows=1 loops=1)
>  Filter: (name = '50a'::text)
>  Rows Removed by Filter: 99
>->  Append  (cost=5.16..771.30 rows=6 width=36) (actual
> time=2.835..2.835 rows=0 loops=1)
>  ->  Bitmap Heap Scan on iot_data_0 da  (cost=5.16..233.78 rows=2
> width=36) (actual time=2.829..2.829 rows=0 loops=1)
>Recheck Cond: (metadata = 50)
>Filter: ((de.id = device) AND (lower(data) ~~ '50'::text))
>->  Bitmap Index Scan on iot_data_0_metadata_lower_idx
>  (cost=0.00..5.14 rows=59 width=0) (actual time=2.827..2.827 rows=0 loops=1)
>  Index Cond: ((metadata = 50) AND (lower(data) =
> '50'::text))
>  ->  Bitmap Heap Scan on iot_data_1 da_1  (cost=5.20..249.32
> rows=2 width=37) (never executed)
>Recheck Cond: (metadata = 50)
>Filter: ((de.id = device) AND (lower(data) ~~ '50'::text))
>->  Bitmap Index Scan on iot_data_1_metadata_lower_idx
>  (cost=0.00..5.18 rows=63 width=0) (never executed)
>  Index Cond: ((metadata = 50) AND (lower(data) =
> '50'::text))
>  ->  Bitmap Heap Scan on iot_data_2 da_2  (cost=5.30..288.16
> rows=2 width=36) (never executed)
>Recheck Cond: (metadata = 50)
>Filter: ((de.id = device) AND (lower(data) ~~ '50'::text))
>->  Bitmap Index Scan on iot_da

Re: pg12 partitions show bad performance vs pg96

2020-03-08 Thread David Rowley
On Mon, 9 Mar 2020 at 05:05, Mariel Cherkassky
 wrote:
> PG12 :
>  Planning Time: 8.157 ms
>  Execution Time: 2.920 ms
> (22 rows)
>
>
> PG96 :
>  Planning time: 0.815 ms
>  Execution time: 0.158 ms
> (12 rows)

8 ms seems pretty slow to planning that query. Does the planning time
drop if you execute this multiple times in the same session? Does the
time change if you try again without any foreign keys?

The planning time for the partitioned case is certainly going to take
longer. Partitioned tables are never going to improve the times of
query planning. It's only possible that they'll improve the
performance during query execution.

For such small fast queries as the ones you've shown, it's important
to remember that more complex query plans (ones with more nodes) do
lead to longer times for executor startup and shutdown.   EXPLAIN
(without ANALYZE), will perform query planning and executor
startup/shutdown. If you enable \timing on in psql and test the
EXPLAIN performance of these queries in each version, then you might
get an idea of where the overheads are.

Additionally, you're unlikely to see performance improvements with
table partitioning unless you're accessing many rows and partitioning
allows the data locality of the rows that you are accessing to
improve. i.e accesses fewer buffers and/or improves cache hit ratios.
In PG12, if the partition pruning can be done during query planning
then the planning and executor startup overhead is much lower since
there are fewer relations to generate access paths for and fewer nodes
in the final plan. This also improves the situation during execution
as it means fewer locks to take and fewer nodes to startup/shutdown.

> As you can see, the results are better in pg96. This example only shows the 
> results for a small data set. In bigger data sets I get a bigger diff...

Can you share the results of that?

> I tried changing many postgresql.conf parameters that were added 
> (max_workers_per_gather,enable_partitionwise_join and so on..).

The former only does anything for parallel queries. None of the plans
you've shown are parallel ones.   The latter also does not count in
this case. It only counts when joining two identically partitioned
tables.

> I dont understand why in pg12 it scans all the partitions instead of the 
> relevant one..

If you'd specified a specific "device" in the query SQL, then the
query planner would know which partition to scan for that particular
device. However, since you're looking up the device in another table
and performing a join, the device is only known during query
execution. The plan nodes for the non-matching partitions do go
through executor startup, but they're not scanned during execution, as
you've seen with the "(never executed)" appearing in the EXPLAIN
ANALYZE output.  Since executor startup occurs before execution, the
device you mean is still unknown during executor startup, so the
executor must startup the nodes for all partitions that are in the
plan. Starting up a plan node is not free, but not really very
expensive either. However, the overhead of it might be quite large
proportionally in your case since the executor is doing so little
work.

The most basic guidelines for table partitioning are, don't partition
your tables unless it's a net win.   If partitioning was always
faster, we'd just have designed Postgres to implicitly partition all
of your tables for you. There are some other guidelines in [1].

[1] 
https://www.postgresql.org/docs/12/ddl-partitioning.html#DDL-PARTITIONING-DECLARATIVE-BEST-PRACTICES

David




Re: pg12 partitions show bad performance vs pg96

2020-03-08 Thread Justin Pryzby
On Sun, Mar 08, 2020 at 06:05:26PM +0200, Mariel Cherkassky wrote:
> In pg12 I created  a table with 3 hash partitiones :
> create table iot_data(id serial ,data text,metadata bigint,device bigint
> references iot_device(id),primary key(id,device)) partition by hash(device);

> and now for the performance:
> query :  explain analyze select * from iot_data da,iot_device de where
> de.name in ('50a') and de.id = da.device and da.metadata=50 and
> lower(da.data) like '50';

> I dont understand why in pg12 it scans all the partitions instead of the
> relevant one..

As you noticed, it doesn't actually scan them.  I believe v11 "partition
elimination during query execution" is coming into play here.  There's no
option to disable that, but as a quick test, you could possibly try under PG10
(note, that doesn't support inherited indexes).  Or you could try to profile
under PG12 (and consider comparing with pg13dev).

You partitioned on hash(iot_data.device), but your query doesn't specify
device, except that da.device=de.id AND de.name IN ('50').  If that's a typical
query, maybe it'd work better to partition on metadata or lower(name) (or
possibly both).

On Sun, Mar 08, 2020 at 06:05:26PM +0200, Mariel Cherkassky wrote:
> PG12 :
>  Nested Loop  (cost=5.16..773.61 rows=2 width=43) (actual time=2.858..2.858
> rows=0 loops=1)
...
>  ->  Bitmap Heap Scan on iot_data_1 da_1  (cost=5.20..249.32 rows=2
> width=37) (NEVER EXECUTED)
...
>  Planning Time: 8.157 ms
>  Execution Time: 2.920 ms

> PG96 :
>  Nested Loop  (cost=6.57..397.19 rows=2 width=44) (actual time=0.121..0.121
> rows=0 loops=1)
...
>  Planning time: 0.815 ms
>  Execution time: 0.158 ms

-- 
Justin