Re: Postgres Optimizer ignores information about foreign key relationship, severly misestimating number of returned rows in join

2020-10-28 Thread Tom Lane
"Ehrenreich, Sigrid"  writes:
> A patch would be very much appreciated.
> We are currently running on Version 12, but could upgrade to 13, if necessary.
> Could you send me a notification if you managed to program a patch for that?

I've pushed a patch for this to HEAD, but current thinking is that we
will not be back-patching it.  Still, if you're desperate you could
consider running a custom build of v13 with that patch --- a quick
check suggests that it would back-patch easily.  v12 would be a
slightly harder lift (I see one hunk doesn't apply) but probably
not by much.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=patch;h=ad1c36b0709e47cdb3cc4abd6c939fe64279b63f

regards, tom lane




query plan using partial index expects a much larger number of rows than is possible

2020-10-28 Thread Olivier Poquet
Hello,
I have a large table (about 13 million rows) full of customer order 
information. Since most of that information is for orders that have already 
been fulfilled, I have a partial index to help quickly zero in on rows that 
have not been fulfilled. This works well, but I noticed today when joining with 
another large table using its primary key that even though the planner was 
using my partial index, it decided to do a merge join to the second large table 
instead of the nested loop I would have expected.

Looking at it in more detail, I found that the planner is assuming that I'll 
get millions of rows back even when I do a simple query that does an index scan 
on my partial index:

=> \d orderitems_committed_unfulfilled 
Index "public.orderitems_committed_unfulfilled"
 Column |  Type  | Key? | Definition 
++--+
 id | bigint | yes  | id
btree, for table "public.orderitems", predicate (LEAST(committed, quantity) > 
fulfilled)

=> explain (analyze, buffers) select oi.id from orderitems oi where 
LEAST(oi.committed, oi.quantity) > oi.fulfilled;

QUERY PLAN  
  
--
 Index Only Scan using orderitems_committed_unfulfilled on orderitems oi  
(cost=0.41..31688.23 rows=2861527 width=8) (actual time=0.039..2.092 rows=2274 
loops=1)
   Heap Fetches: 2493
   Buffers: shared hit=1883
 Planning Time: 0.110 ms
 Execution Time: 2.255 ms
(5 rows)

So nice and quick, but the planner thought it would get back 2861527 rows 
instead of the 2274 I actually have.  That explains why it thought it would 
make sense to do a merge join with my other large table instead of the nested 
loop over the 2k rows.  I would have expected the planner to know that there's 
no way it'll get back 2 million rows though, given that:

=> select relname, relpages, reltuples from pg_class where relname = 
'orderitems_committed_unfulfilled';
 relname  | relpages | reltuples 
--+--+---
 orderitems_committed_unfulfilled | 3051 |  2112
(1 row)

It knows there's only 2k-ish of them from the index. The 2 mil number is the 
same as what the planner expects if I disable using indexes and it does a seq 
scan, so I'm assuming it's just the guess from the column statistics and the 
planner is not using the size of the partial index.

I'm running:
=> select version();
  version   
   
---
 PostgreSQL 12.4 on x86_64-pc-linux-gnu, compiled by gcc, a 97d579287 p 
0be2109a97, 64-bit
(1 row)

I'm wondering if the behavior that I'm seeing is expected in 12.4, and if so if 
it changes in a later version or if I should file an enhancement request? Or if 
it's not expected is there's something I'm doing wrong, or should file a bug?

Thanks for your time.

-- 
  Olivier Poquet
  [email protected]




Re: query plan using partial index expects a much larger number of rows than is possible

2020-10-28 Thread Tom Lane
"Olivier Poquet"  writes:
> Looking at it in more detail, I found that the planner is assuming that I'll 
> get millions of rows back even when I do a simple query that does an index 
> scan on my partial index:

We don't look at partial-index predicates when trying to estimate the
selectivity of a WHERE clause.  It's not clear to me whether that'd be
a useful thing to do, or whether it could be shoehorned into the system
easily.  (One big problem is that while the index size could provide
an upper bound, it's not apparent how to combine that knowledge with
selectivities of unrelated conditions.  Also, it's riskier to extrapolate
a current rowcount estimate from stale relpages/reltuples data for an
index than it is for a table, because the index is less likely to scale
up linearly.)

If this particular query is performance-critical, you might consider
materializing the condition, that is something like

create table orderitems (
   ... ,
   committed_unfulfilled bool GENERATED ALWAYS AS
 (LEAST(committed, quantity) > fulfilled) STORED
);

and then your queries and your partial-index predicate must look
like "WHERE committed_unfulfilled".  Having done this, ANALYZE
would gather stats on the values of that column and the WHERE
clauses would be estimated accurately.

regards, tom lane




Re: query plan using partial index expects a much larger number of rows than is possible

2020-10-28 Thread Olivier Poquet
Thanks Tom,
That makes perfect sense.  

I'd already gone the route of materializing the condition but I didn't even 
realize that generated columns was an option (I'd done the same with triggers 
instead).  So thanks a lot of that too!

-- 
  Olivier Poquet
  [email protected]

On Wed, Oct 28, 2020, at 7:30 PM, Tom Lane wrote:
> "Olivier Poquet"  writes:
> > Looking at it in more detail, I found that the planner is assuming that 
> > I'll get millions of rows back even when I do a simple query that does an 
> > index scan on my partial index:
> 
> We don't look at partial-index predicates when trying to estimate the
> selectivity of a WHERE clause.  It's not clear to me whether that'd be
> a useful thing to do, or whether it could be shoehorned into the system
> easily.  (One big problem is that while the index size could provide
> an upper bound, it's not apparent how to combine that knowledge with
> selectivities of unrelated conditions.  Also, it's riskier to extrapolate
> a current rowcount estimate from stale relpages/reltuples data for an
> index than it is for a table, because the index is less likely to scale
> up linearly.)
> 
> If this particular query is performance-critical, you might consider
> materializing the condition, that is something like
> 
> create table orderitems (
>... ,
>committed_unfulfilled bool GENERATED ALWAYS AS
>  (LEAST(committed, quantity) > fulfilled) STORED
> );
> 
> and then your queries and your partial-index predicate must look
> like "WHERE committed_unfulfilled".  Having done this, ANALYZE
> would gather stats on the values of that column and the WHERE
> clauses would be estimated accurately.
> 
>   regards, tom lane
>




Re: Understanding bad estimate (related to FKs?)

2020-10-28 Thread Justin Pryzby
On Mon, Oct 26, 2020 at 11:20:01AM -0600, Michael Lewis wrote:
> On Mon, Oct 26, 2020 at 11:14 AM Philip Semanchuk 
>  wrote:
> 
> > >> The item I'm focused on is node 23. The estimate is for 7 rows, actual
> > is 896 (multiplied by 1062 loops). I'm confused about two things in this
> > node.
> > >>
> > >> The first is Postgres' estimate. The condition for this index scan
> > contains three expressions --
> > >>
> > >> (five_uniform = zulu_five.five_uniform) AND
> > >> (whiskey_mike = juliet_india.whiskey_mike) AND
> > >> (bravo = 'mike'::text)
> >
> 
> Are the columns correlated?

I guess it shouldn't matter, since the FKs should remove all but one of the
conditions.

Maybe you saw this other thread, which I tentatively think also affects your
case (equijoin with nonjoin condition)
https://www.postgresql.org/message-id/AM6PR02MB5287A0ADD936C1FA80973E72AB190%40AM6PR02MB5287.eurprd02.prod.outlook.com