FDW join vs full join push down

2021-09-07 Thread Marc Olivé
Hello all,

We have some PG servers which we merge into a "coordinator" node using FDW
and partitioned tables, we partition them by a synthetic "shard_id" field.
There are around 30 tables coordinated this way, with all foreign servers
having the same schema structure.

We have some performance issues when joining foreign tables, always done by
the same "shard_id", where the major bottleneck is how rows from joined
tables are fetched. explain(verbose) shows:

Remote SQL: SELECT entity_id, execution_id, shard_id FROM entity_execution
WHERE ((shard_id = 5)) AND (($1::bigint = entity_id))

This way, PG is doing a lot of round trips between the coordinator and the
foreign nodes, fetching a single row every time, and we have a very high
latency between the coordinator and the nodes.

As the joins are done on the same node, it could send the whole query and
fetch all results in a single round trip.

The FDW are configured with 'use_remote_estimate' to true and we have the
parameters enable_partition_pruning, enable_partitionwise_aggregate and
enable_partitionwise_join activated.
The tables involved can have from a million rows to more than 1000
millions, but the queries usually return a few thousand rows.

A full sample plan and it's query: https://explain.depesz.com/s/TbJy
explain(verbose)
select *
from nlp.note_entity_label nel
join nlp.note_entity ne on ne.note_entity_id = nel.note_entity_id and
ne.shard_id = nel.shard_id
join nlp.note_entity_execution nex on nex.note_entity_id =
ne.note_entity_id and nex.shard_id = nel.shard_id
where
nel.label_id = 192
and nel.shard_id = 5

The row estimates are quite off the true ones, even though we have run
'analyze' on the remote nodes before, and 'use_remote_estimate' is on.
The above query ends in about 6 minutes.

The interesting part is that if we change the 'join' by 'full joins', with
some extra filter, the plan is the one we believe is the optimal one, and
indeed the query ends in 1 second: https://explain.depesz.com/s/b3As

explain(verbose)
with ents as(
select nel.note_entity_id nelid, ne.note_entity_id neid,
nex.note_entity_id nexid, *
from nlp.note_entity_label nel
full join nlp.note_entity ne on ne.note_entity_id = nel.note_entity_id
and ne.shard_id = nel.shard_id
full join nlp.note_entity_execution nex on nex.note_entity_id =
ne.note_entity_id and nex.shard_id = nel.shard_id
where
nel.label_id = 192
and nel.shard_id = 5
)
select *
from ents
where nelid is not null
and neid is not null
and nexid is not null
;

Here we can see that the whole query is sent to the fdw and it finishes in
a reasonable time.

So, the question is if we can do something to make the fdw send the whole
query to the remote nodes when the involved joins use the same partition,
or why isn't PG sending it when we use 'inner join'.
We have tried tweaking the "fdw_tuple_cost" , increasing and lowering it to
unreasonable values
10, 1000, 10 and 100 without the desired result.

Thanks,


Foreign table as partition - Non optimal aggregation plan

2021-09-07 Thread Stepan Yankevych
Hi All!

We are using such feature as Foreign table as partition in PG 13 under CentOS
Here is our table
CREATE TABLE dwh.l1_snapshot (
l1_snapshot_id int8 NOT NULL DEFAULT 
nextval('sq_l1_snapshot_id'::regclass),
start_date_id int4 NULL,
...
...
...
dataset_id int4 NULL, -- ETL needs
transaction_time timestamp NULL
)
PARTITION BY RANGE (start_date_id);


We have several partitions locally and one partition for storing historical 
data as foreign table which is stored on another PG13
When I run following query . Partition pruning redirect query to that foreign 
table
select count(1) from dwh.l1_snapshot ls where start_date_id  = 20201109;
I see remote SQL as following

SELECT NULL FROM dwh.l1_snapshot_tail2 WHERE ((start_date_id = 20201109)).
It transfers vie network hundred million records in our case

When I query directly partition (almost the same what partition pruning does) I 
see another remote sql

select count(1) from partitions.l1_snapshot_tail2 ls where start_date_id  = 
20201109;

And remote sql is
SELECT count(1) FROM dwh.l1_snapshot_tail2 WHERE ((start_date_id = 20201109));

So in case querying foreign table we see aggregation is propagated to remote 
host (Like driving_site in oracle)
But in the first case with partition pruning the aggregation is not propagated 
to remote host.
And of course different performance 22 sec vs 75sec


That would great to have the same behavior in both cases (pushing aggregation 
to remote side).
It should be possible at least for simple aggregation (without distinct etc)


Thanks!
Stepan Yankevych

Office: +380 322 424 642xx58840  Cell: 
+380 96 915 9551  Email: 
[email protected]
Lviv,  Ukraine  epam.com


CONFIDENTIALITY CAUTION AND DISCLAIMER
This message is intended only for the use of the individual(s) or entity(ies) 
to which it is addressed and contains information that is legally privileged 
and confidential. If you are not the intended recipient, or the person 
responsible for delivering the message to the intended recipient, you are 
hereby notified that any dissemination, distribution or copying of this 
communication is strictly prohibited. All unintended recipients are obliged to 
delete this message and destroy any printed copies.