Useless memoize path generated for unique join on primary keys

2022-05-03 Thread Benjamin Coutu
Hello,

I have come across a plan that should never get generated IMHO:

SELECT 1
FROM extdataregular e1
INNER JOIN extdataempty e2 ON e1.field = e2.field AND e1.index = e2.index

generates the following plan:

Nested Loop  (cost=1.13..528540.89 rows=607604 width=4) (actual 
time=9298.504..9298.506 rows=0 loops=1)
  ->  Index Only Scan using pk_extdataempty on extdataempty e2  
(cost=0.56..157969.52 rows=4078988 width=16) (actual time=0.026..641.248 
rows=4067215 loops=1)
Heap Fetches: 268828
  ->  Memoize  (cost=0.58..0.67 rows=1 width=16) (actual time=0.002..0.002 
rows=0 loops=4067215)
Cache Key: e2.field, e2.index
Cache Mode: logical
Hits: 0  Misses: 4067215  Evictions: 3228355  Overflows: 0  Memory 
Usage: 65537kB
Buffers: shared hit=16268863
->  Index Only Scan using pk_extdataregular on extdataregular e1  
(cost=0.57..0.66 rows=1 width=16) (actual time=0.001..0.001 rows=0 
loops=4067215)
  Index Cond: ((field = e2.field) AND (index = e2.index))
  Heap Fetches: 2

Please note that the memoize node has no cache hits, which is not surprising 
given that we are joining on two primary keys that are unique by definition 
("field" and "index" make up the primary key of both tables).
Why would it ever make sense to generate a memoize plan for a unique join?

I think this issue might tie in with the current discussion over on the hackers 
mailing list [1]

Cheers, Ben

[1] 
https://www.postgresql.org/message-id/flat/CAApHDvpFsSJAThNLtqaWvA7axQd-VOFct%3DFYQN5muJV-sYtXjw%40mail.gmail.com

-- 

Bejamin Coutu
[email protected]

ZeyOS GmbH & Co. KG
http://www.zeyos.com




Re: Useless memoize path generated for unique join on primary keys

2022-05-03 Thread Benjamin Coutu
> I'd say it's a pretty different problem. The cache hit ratio
> discussion on that thread talks about underestimating the hit ratio.
> That particular problem could only lead to Memoize plans *not* being
> chosen when they maybe should be. Not the other way around, which is
> your case.
> 
> create statistics extdataregular_field_index_stats (ndistinct) on
> field, index from extdataregular;
> analyze extdataregular;
> 
> would likely put that right.

Thanks David, using extended statistics for both (and only for both) tables 
solved this problem.

BTW, thank you for all your work on performance in recent releases.




Unclamped row estimates whith OR-ed subplans

2020-06-19 Thread Benjamin Coutu
Hello,

please consider the following SQL query:

SELECT * FROM "transactions" WHERE 
"account" IN (SELECT "ID" FROM "accounts" WHERE "name" ~~* '%test%') OR
"contract" IN (SELECT "ID" FROM "contracts" WHERE "name" ~~* '%test%')

This yields the following plan on Postgres 11:

Seq Scan on transactions  (cost=67.21..171458.03 rows=1301316 width=1206)
  Filter: ((hashed SubPlan 1) OR (hashed SubPlan 2))
  SubPlan 1
->  Bitmap Heap Scan on accounts  (cost=33.36..61.16 rows=46 width=4)
  Recheck Cond: ((name)::text ~~* '%test%'::text)
  ->  Bitmap Index Scan on s_accounts  (cost=0.00..33.35 rows=46 
width=0)
Index Cond: ((name)::text ~~* '%test%'::text)
  SubPlan 2
->  Seq Scan on contracts  (cost=0.00..5.93 rows=5 width=4)
  Filter: ((name)::text ~~* '%test%'::text)

So the where clause of this query has just two subplans OR-ed together, one is 
estimated to yield 46 rows and one is estimated to yield 5 rows.
I'd expect the total rows for the seqscan to be estimated at 46 then, following 
the logic that rows_seqscan = max(rows_subplan1, rows_subplan2). As you can 
see, the optimizer estimates a whopping 1301316 rows instead.

I am absolutely aware that those are hashed sub plans below a seqscan and that 
Postgres therefore has to scan all tuples of the table. But the problem is that 
upper nodes (which are excluded from this example for simplicity) think they 
will receive 1301316 rows from the seqscan, when in fact they will probably 
only see a hand full, which the planner could have (easily?) deduced by taking 
the greater of the two subplan row estimates.

What am I missing, or is this perhaps a shortfall of the planner?

Thanks,

Ben

-- 

Bejamin Coutu
[email protected]

ZeyOS GmbH & Co. KG
http://www.zeyos.com




Re: Unclamped row estimates whith OR-ed subplans

2020-06-19 Thread Benjamin Coutu
> No.  The subplan estimates are for the number of rows produced by one
> execution of the subplan, ie the numbers of "accounts" or "contracts"
> rows that match those inner WHERE conditions.  This has very little
> a-priori relationship to the number of "transactions" rows that will
> satisfy the outer WHERE condition.  If we knew that transactions.account
> and transactions.contract were unique columns, then we might be able
> to say that there shouldn't be more than one outer match per subplan
> result row ... but you didn't say that, and it seems unlikely.

Thanks Tom, I understand your point.

I don't want to waste your time but maybe there is room for improvement as both 
"account" and "contract" are highly distinct and the individual subplan 
estimates are quite accurate:

Seq Scan on transactions  (cost=67.81..171458.63 rows=1301316 width=1206) 
(actual time=69.418..917.594 rows=112 loops=1)
  Filter: ((hashed SubPlan 1) OR (hashed SubPlan 2))
  Rows Removed by Filter: 1792937
  SubPlan 1
->  Bitmap Heap Scan on accounts  (cost=33.96..61.76 rows=46 width=4) 
(actual time=3.053..3.292 rows=111 loops=1)
  Recheck Cond: ((name)::text ~~* '%test%'::text)
  Rows Removed by Index Recheck: 4
  Heap Blocks: exact=104
  ->  Bitmap Index Scan on s_accounts  (cost=0.00..33.95 rows=46 
width=0) (actual time=0.505..0.505 rows=118 loops=1)
Index Cond: ((name)::text ~~* '%test%'::text)
  SubPlan 2
->  Seq Scan on contracts  (cost=0.00..5.93 rows=5 width=4) (actual 
time=2.531..2.836 rows=4 loops=1)
  Filter: ((name)::text ~~* '%test%'::text)
  Rows Removed by Filter: 272

For comparison here are the plans for the queries with the individual where 
clauses:

SELECT * FROM "transactions" WHERE "account" IN (SELECT "ID" FROM "accounts" 
WHERE "name" ~~* '%test%')

Nested Loop  (cost=34.38..488.93 rows=155 width=1206) (actual time=0.599..1.393 
rows=112 loops=1)
  ->  Bitmap Heap Scan on accounts  (cost=33.96..61.76 rows=46 width=4) (actual 
time=0.541..0.796 rows=111 loops=1)
Recheck Cond: ((name)::text ~~* '%test%'::text)
Rows Removed by Index Recheck: 4
Heap Blocks: exact=104
->  Bitmap Index Scan on s_accounts  (cost=0.00..33.95 rows=46 width=0) 
(actual time=0.521..0.521 rows=118 loops=1)
  Index Cond: ((name)::text ~~* '%test%'::text)
  ->  Index Scan using fk_transactions_account on transactions  
(cost=0.43..9.08 rows=21 width=1206) (actual time=0.004..0.005 rows=1 loops=111)
Index Cond: (account = accounts."ID")

SELECT * FROM "transactions" WHERE "contract" IN (SELECT "ID" FROM "contracts" 
WHERE "name" ~~* '%test%')

Nested Loop  (cost=3.76..10.10 rows=31662 width=1206) (actual time=0.082..0.082 
rows=0 loops=1)
  ->  Bitmap Heap Scan on contracts  (cost=3.64..5.74 rows=5 width=4) (actual 
time=0.069..0.075 rows=4 loops=1)
Recheck Cond: ((name)::text ~~* '%test%'::text)
Heap Blocks: exact=2
->  Bitmap Index Scan on s_contracts  (cost=0.00..3.64 rows=5 width=0) 
(actual time=0.060..0.060 rows=4 loops=1)
  Index Cond: ((name)::text ~~* '%test%'::text)
  ->  Index Scan using fk_transactions_contract on transactions  
(cost=0.12..0.86 rows=1 width=1206) (actual time=0.001..0.001 rows=0 loops=4)
Index Cond: (contract = contracts."ID")

The statistics for the columns are:

SELECT attname, null_frac, n_distinct from pg_stats WHERE tablename = 
'transactions' AND attname IN ('account', 'contract')

transactions.account:   null_frac=0.025 n_distinct=80277
transactions.contract:  null_frac=1 n_distinct=0 (there are basically 
no non-null values for field "contract" in transactions)

According to pg_class.reltuples the table "transactions" has 1735088 rows.

I'd naively expect the selectivity for an OR of those two hashed subplans given 
uniform distribution to be:

rows_total = 
((rows_transactions * (1 - null_frac_account)) / n_distinct_account) * 
expected_rows_from_subplan1 +
((rows_transactions * (1 - null_frac_contract)) / n_distinct_contract) 
* expected_rows_from_subplan2

=> rows_total = 
((1735088 * (1 - 0.025)) / 80277) * 46 +
((1735088 * (1 - 1)) / 0) * 5

=> rows_total = 969 + 0 /* no non-null values for contract field */

Please forgive the sloppy math but something along this line could be promising.

Btw, I don't quite understand why the nested loop on contract only is expected 
to yield 31662 rows, when the null_frac of field transactions.contract is 1. 
Shouldn't that indicate zero rows or some kind of default minimum estimate for 
that query?

Thanks again!

Benjamin Coutu




Re: Unclamped row estimates whith OR-ed subplans

2020-06-19 Thread Benjamin Coutu
> While you're waiting, you might think about recasting the query to
> avoid the OR.  Perhaps you could do a UNION of two scans of the
> transactions table?

Thanks for the hint, I am well aware of the workaround for OR via UNION. I am 
not trying to improve this query per se as it is the small root of a very 
complex generated query and it's unfeasible to rewrite it to a UNION in this 
case.

The point of my message to the list was to highlight the misestimation, which I 
couldn't wrap my head around. Maybe this discussion can give some food for 
thought to someone who might tackle this in the future. It would surely be 
great to have selectivity estimate smarts for the generic case of OR-ed 
SubPlans someday.

> 
> > Btw, I don't quite understand why the nested loop on contract only is 
> > expected to yield 31662 rows, when the null_frac of field 
> > transactions.contract is 1. Shouldn't that indicate zero rows or some kind 
> > of default minimum estimate for that query?
> 
> That I don't understand.  I get a minimal rowcount estimate for an
> all-nulls outer table, as long as I'm using just one IN rather than

Yeah, that might be worth digging into. Is there any other info apart from 
those stats that I could provide?