RE: Postgres Optimizer ignores information about foreign key relationship, severly misestimating number of returned rows in join

2020-10-27 Thread Ehrenreich, Sigrid
Hi Tom,

A patch would be very much appreciated.
We are currently running on Version 12, but could upgrade to 13, if necessary.

Could you send me a notification if you managed to program a patch for that?

Regards,
Sigrid

-Original Message-
From: Tom Lane  
Sent: Monday, October 26, 2020 11:54 PM
To: David Rowley 
Cc: Ehrenreich, Sigrid ; 
[email protected]
Subject: Re: Postgres Optimizer ignores information about foreign key 
relationship, severly misestimating number of returned rows in join

David Rowley  writes:
> On Tue, 27 Oct 2020 at 06:54, Ehrenreich, Sigrid  
> wrote:
>> ->  Hash Join  (cost=226.27..423.82 rows=115 width=0) (actual 
>> time=3.150..7.511 rows=3344 loops=1)   <=== With the FK, the 
>> estimation should be 3344, but it is 115 rows

> I'd have expected this to find the foreign key and have the join
> selectivity of 1.0, but I see it does not due to the fact that one of
> the EquivalenceClass has a constant due to the fact.low_card = 1 qual.

Right.

> I feel like we could probably do better there and perhaps somehow
> count ECs with ec_has_const as matched, but there seems to be some
> assumptions later in get_foreign_key_join_selectivity() where we
> determine the selectivity based on the base rel's tuple count.  We'd
> need to account for how many rows remainder after filtering the ECs
> with ec_has_const == true, else we'd be doing the wrong thing.  That
> needs more thought than I have time for right now.

Yeah, I'm fooling with a patch for that now.  The basic problem is
that the selectivity of the x = constant clauses has already been
factored into the sizes of both join input relations, so we're
double-counting it if we just apply the existing FK-based
selectivity estimate.  I think though that we can recover the
selectivity associated with that qual on the FK side (or should
it be the PK side?) and cancel it out of the FK selectivity.

regards, tom lane




Re: Query Performance / Planner estimate off

2020-10-27 Thread Mats Olsen


On 10/21/20 5:35 PM, Sebastian Dressler wrote:

Hi Mats,

Happy to help.

On 21. Oct 2020, at 16:42, Mats Olsen > wrote:

On 10/21/20 2:38 PM, Sebastian Dressler wrote:

Hi Mats,

On 20. Oct 2020, at 11:37, Mats Julian Olsen 
mailto:[email protected]>> wrote:


[...]

1) Vanilla plan (16 min) : https://explain.depesz.com/s/NvDR 

2) enable_nestloop=off (4 min): https://explain.depesz.com/s/buKK 

3) enable_nestloop=off; enable_seqscan=off (2 min): 
https://explain.depesz.com/s/0WXx 


How can I get Postgres not to loop over 12M rows?


I looked at the plans and your config and there are some thoughts 
I'm having:


- The row estimate is off, as you possibly noticed. This can be 
possibly solved by raising `default_statistics_target` to e.g. 2500 
(we typically use that) and run ANALYZE
I've `set default_statistics_target=2500` and ran analyze on both 
tables involved, unfortunately the plan is the same. The columns we 
use for joining here are hashes and we expect very few duplicates in 
the tables. Hence I think extended statistics (storing most common 
values and histogram bounds) aren't useful for this kind of data. 
Would you say the same thing?


Yes, that looks like a given in this case.



- I however think that the misestimate might be caused by the 
evt_tx_hash being of type bytea. I believe that PG cannot estimate 
this very well for JOINs and will rather pick row numbers too low. 
Hence the nested loop is picked and there might be no way around 
this. I have experienced similar things when applying JOINs on 
VARCHAR with e.g. more than 3 fields for comparison.


This is very interesting, and I have never heard of issues with using 
`bytea` for joins. Our entire database is filled with them, as we 
deal with hashes of different lengths. In fact I would estimate that 
60% of columns are bytea's. My intuition would say that it's better 
to store the hashes as byte arrays, rather than `text` fields as you 
can compare the raw bytes directly without encoding first?  Do you 
have any references for this?


Unfortunately, I have not dealt yet with `bytea` that much. It just 
rang a bell when I saw these kind of off-estimates in combination with 
nested loops. In the case I referenced it was, that the tables had 3 
VARCHAR columns to be joined on and the estimate was very much off. As 
a result, PG chose nested loops in the upper layers of processing. Due 
to another JOIN the estimate went down to 1 row whereas it was 1 
million rows in reality. Now, yours is "only" a factor 5 away, i.e. 
this might be a totally different reason.


However, I looked into the plan once more and realized, that the 
source of the problem could also be the scan on "Pair_evt_Mint" along 
the date dimension. Although you have a stats target of 10k there. If 
the timestamp is (roughly) sorted, you could try adding a BRIN index 
and by that maybe get a better estimate & scan-time.
Hi again, after around 48 hours a CREATE INDEX CONCURRENTLY ran 
successfully. The new plan still uses a nested loop, but the scan on 
"Pair_evt_Mint" is now a Parallel index scan. See 
https://explain.depesz.com/s/8ZzT


Alternatively, since I know the length of the hashes in advance, I 
could've used `varchar(n)`, but I don't think there's any gains to be 
had in postgres by doing that? Something like `bytea(n)` would also 
have been interesting, had postgres been able to exploit that 
information.


I think giving VARCHAR a shot makes sense, maybe on an experimental 
basis to see whether the estimates get better. Maybe PG can then 
estimate that there are (almost) no dupes within the table but that 
there are N-many across tables. Another option to explore is maybe to 
use UUID as a type. As said above, it more looks like the timestamp 
causing the mis-estimate.


Maybe try querying this table by itself with that timestamp to see 
what kind of estimate you get?



- Other things to look into:

    - work_mem seems too low to me with 56MB, consider raising this 
to the GB range to avoid disk-based operations

    - min_parallel_table_scan_size - try 0
    - parallel_setup_cost (default 1000, maybe try 500)
    - parallel_tuple_cost (default 1.0, maybe try 0.1)
    - random_page_cost (as mentioned consider raising this maybe 
much higher, factor 10 or sth like this) or (typically) 
seq_page_cost can be possibly much lower (0.1, 0.01) depending on 
your storage


I've tried various settings of these parameters now, and 
unfortunately the only parameter that alters the query plan is the 
last one (random_page_cost), which also has the side effect of 
(almost) forcing sequential scans for most queries as far as I 
understand? Our storage is Google Cloud pd-ssd.


I think a combination of random_page_cost with parallel_tuple_cost and 
min_parallel_table_scan_size might make sense. By that you possibly 
get at leas