>It is all the time a challenge for PostgreSQL to estimate such a filter
>because of absent information on joint column distribution.
>Can you research this way by building extended statistics on these
>clauses? It could move the plan to the more optimal direction.
Thanks a lot for your effort to analyze this issue, and we really appreciate
your suggestions! Currently, we focus on exposing these issues that affect
performance. In the future, we may consider to look into such a direction as
you suggested.
> Have you tried any tools to improve the cardinality yet, like aqo [0]?
Yes, but it takes nearly 1 hour to run this query at a time, so I only run
"EXPLAIN ANALYZE" once, and the performance seems slightly improved.
QUERY PLAN
Limit (cost=293880.50..293880.50 rows=1 width=132) (actual
time=2527921.078..2527921.233 rows=8 loops=1)
CTE year_total
-> Gather (cost=115049.92..233367.07 rows=384208 width=216) (actual
time=1116.139..4005.105 rows=384208 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Append (cost=114049.92..193946.27 rows=160087
width=216) (actual time=2430.791..2510.131 rows=128069 loops=3)
-> HashAggregate (cost=190763.57..193145.83 rows=190581
width=216) (actual time=3977.521..4070.200 rows=190581 loops=1)
Group Key: customer.c_customer_id,
customer.c_first_name, customer.c_last_name, customer.c_preferred_cust_flag,
customer.c_birth_country, customer.c_login, customer.c_email_address,
date_dim.d_year
Worker 1: Batches: 1 Memory Usage: 120857kB
-> Hash Join (cost=8151.60..103486.35 rows=2685453
width=174) (actual time=64.667..1605.601 rows=2685453 loops=1)
Hash Cond: (store_sales.ss_sold_date_sk =
date_dim.d_date_sk)
-> Hash Join (cost=5103.00..93216.88
rows=2750652 width=174) (actual time=48.111..1121.801 rows=2750652 loops=1)
Hash Cond: (store_sales.ss_customer_sk =
customer.c_customer_sk)
-> Seq Scan on store_sales
(cost=0.00..80552.52 rows=2880404 width=30) (actual time=0.068..230.529
rows=2880404 loops=1)
-> Hash (cost=3853.00..3853.00 rows=10
width=152) (actual time=47.735..47.735 rows=10 loops=1)
Buckets: 131072 Batches: 1 Memory
Usage: 17161kB
-> Seq Scan on customer
(cost=0.00..3853.00 rows=10 width=152) (actual time=0.012..25.023
rows=10 loops=1)
-> Hash (cost=2135.49..2135.49 rows=73049
width=8) (actual time=16.242..16.242 rows=73049 loops=1)
Buckets: 131072 Batches: 1 Memory Usage:
3878kB
-> Seq Scan on date_dim
(cost=0.00..2135.49 rows=73049 width=8) (actual time=0.074..8.744 rows=73049
loops=1)
-> HashAggregate (cost=114049.92..115762.15 rows=136978
width=216) (actual time=2199.723..2268.851 rows=136978 loops=1)
Group Key: customer_1.c_customer_id,
customer_1.c_first_name, customer_1.c_last_name,
customer_1.c_preferred_cust_flag, customer_1.c_birth_country,
customer_1.c_login, customer_1.c_email_address, date_dim_1.d_year
Worker 0: Batches: 1 Memory Usage: 88089kB
-> Hash Join (cost=8151.60..67544.41 rows=1430939
width=177) (actual time=81.920..911.231 rows=1430939 loops=1)
Hash Cond: (catalog_sales.cs_sold_date_sk =
date_dim_1.d_date_sk)
-> Hash Join (cost=5103.00..60729.97
rows=1434519 width=177) (actual time=53.469..638.140 rows=1434519 loops=1)
Hash Cond:
(catalog_sales.cs_bill_customer_sk = customer_1.c_customer_sk)
-> Seq Scan on catalog_sales
(cost=0.00..51842.75 rows=1441548 width=33) (actual time=0.066..134.023
rows=1441548 loops=1)
-> Hash (cost=3853.00..3853.00 rows=10
width=152) (actual time=52.937..52.937 rows=10 loops=1)
Buckets: 131072 Batches: 1 Memory
Usage: 17161kB
-> Seq Scan on customer customer_1
(cost=0.00..3853.00 rows=10 width=152) (actual time=0.019..27.549
rows=10 loops=1)
-> Hash (cost=2135.49..2135.49 rows=73049
width=8) (actual time=27.968..27.968 rows=73049 loops=1)
Buckets: 131072 Batches: 1 Memory Usage:
3878kB
-> Seq Scan on date_dim date_dim_1
(cost=0.00..2135.49 rows=73049 width=8) (actual time=0.099..14.115 rows=73049
loops=1)
-> HashAggregate (cost=61268.33..61976.44 rows=56649
width=