Query takes way longer with LIMIT, and EXPLAIN takes way longer than actual query

2020-08-14 Thread Ken Tanzer
Hi.  I've got a query that runs fine (~50ms).  When I add a "LIMIT 25" to
it though, it takes way longer.  The query itself then takes about 4.5
seconds.  And when I do an explain, it takes 90+ seconds for the same query!

Explains and detailed table/view info below.  tbl_log has 1.2M records,
tbl_reference has 550k.  This is 9.6.19 on CentOS 6 with PDGG packages.

I know the query itself could be re-written, but it's coming from an ORM,
so I'm really focused on why the adding a limit is causing such performance
degradation, and what to do about it.  Any help or insight would be
appreciated. Also the discrepancy between the actual query and the
explain.  Thanks!

Ken


*The good query (no LIMIT):*

agency=> EXPLAIN (ANALYZE,VERBOSE,BUFFERS,TIMING)  SELECT * FROM Log WHERE
log_id IN (SELECT CASE WHEN to_table='client' THEN from_id WHEN
from_table='client' THEN to_id END FROM reference WHERE ((from_id_field =
 E'client_id'
AND from_id =  E'34918'
AND from_table =  E'client'
AND to_table =  E'log'
)
OR  (to_id_field =  E'client_id'
AND to_id =  E'34918'
AND to_table =  E'client'
AND from_table =  E'log'
))) ORDER BY added_at DESC;


   QUERY PLAN



-

 Sort  (cost=167065.81..168594.77 rows=611586 width=336) (actual
time=43.942..46.566 rows=1432 loops=1)
   Output: tbl_log.log_id, tbl_log.log_type_code, tbl_log.subject,
tbl_log.log_text, tbl_log.occurred_at, tbl_log.shift_report,
tbl_log.written_by, tbl_log.added_by, tbl_log.added_at, tbl_log.changed_by,
tbl_log.changed_at, tbl_log.sys_log
   Sort Key: tbl_log.added_at DESC
   Sort Method: quicksort  Memory: 999kB
   Buffers: shared hit=7026
   ->  Nested Loop  (cost=4313.36..14216.18 rows=611586 width=336) (actual
time=10.837..38.177 rows=1432 loops=1)
 Output: tbl_log.log_id, tbl_log.log_type_code, tbl_log.subject,
tbl_log.log_text, tbl_log.occurred_at, tbl_log.shift_report,
tbl_log.written_by, tbl_log.added_by, tbl_log.added_at, tbl_log.changed_by,
tbl_log.changed_at, tbl_log.sys_log
 Buffers: shared hit=7026
 ->  HashAggregate  (cost=4312.93..4325.68 rows=1275 width=136)
(actual time=10.802..13.800 rows=1433 loops=1)
   Output: tbl_reference.to_table, tbl_reference.from_id,
tbl_reference.from_table, tbl_reference.to_id
   Group Key: CASE WHEN (tbl_reference.to_table =
'client'::name) THEN tbl_reference.from_id WHEN (tbl_reference.from_table =
'client'::name) THEN tbl_reference.to_id ELSE NULL::integer END
   Buffers: shared hit=1288
   ->  Bitmap Heap Scan on public.tbl_reference
 (cost=46.69..4309.74 rows=1276 width=136) (actual time=0.747..6.822
rows=1433 loops=1)
 Output: tbl_reference.to_table, tbl_reference.from_id,
tbl_reference.from_table, tbl_reference.to_id, CASE WHEN
(tbl_reference.to_table = 'client'::name) THEN tbl_reference.from_id WHEN
(tbl_reference.from_table = 'client'::name) THEN tbl_reference.to_id ELSE
NULL::integer END
 Recheck Cond: ((tbl_reference.from_id_field =
'client_id'::name) OR ((tbl_reference.to_table = 'client'::name) AND
(tbl_reference.to_id = 34918)))
 Filter: ((NOT tbl_reference.is_deleted) AND
(((tbl_reference.from_id_field = 'client_id'::name) AND
(tbl_reference.from_id = 34918) AND (tbl_reference.from_table =
'client'::name) AND (tbl_reference.to_table = 'log'::name)) OR
((tbl_reference.to_id_field = 'client_id'::name) AND (tbl_reference.to_id =
34918) AND (tbl_reference.to_table = 'client'::na
me) AND (tbl_reference.from_table = 'log'::name
 Rows Removed by Filter: 15
 Heap Blocks: exact=1275
 Buffers: shared hit=1288
 ->  BitmapOr  (cost=46.69..46.69 rows=1319 width=0)
(actual time=0.453..0.454 rows=0 loops=1)
   Buffers: shared hit=13
   ->  Bitmap Index Scan on
index_tbl_reference_from_id_field  (cost=0.00..4.43 rows=1 width=0) (actual
time=0.025..0.026 rows=0 loops=1)
 Index Cond: (tbl_reference.from_id_field =
'client_id'::name)
 Buffers: shared hit=3
   ->  Bitmap Index Scan on
index_tbl_reference_to_table_id  (cost=0.00..41.61 rows=1319 width=0)
(actual time=0.421..0.423 rows=1448 loops=1)
 Index Cond: ((tbl_reference.to_table =
'client'::name) AND (tbl_reference.to_id = 34918))
 

Re: Query takes way longer with LIMIT, and EXPLAIN takes way longer than actual query

2020-08-14 Thread Justin Pryzby
On Fri, Aug 14, 2020 at 02:34:52PM -0700, Ken Tanzer wrote:
> Hi.  I've got a query that runs fine (~50ms).  When I add a "LIMIT 25" to
> it though, it takes way longer.  The query itself then takes about 4.5
> seconds.  And when I do an explain, it takes 90+ seconds for the same query!

Due to the over-estimated rowcount, the planner believes that (more) rows will
be output (sooner) than they actually are:

   ->  Nested Loop Semi Join  (cost=47.11..31367302.81 ROWS=611586 width=336) 
(actual time=47.098..97236.123 ROWS=25 loops=1)

So it thinks there's something to be saved/gained by using a plan that has a
low startup cost.  But instead, it ends up running for a substantial fraction
of the total (estimated) cost.

As for the "explain is more expensive than the query", that could be due to
timing overhead, as mentioned here.  Test with "explain (timing off)" ?
https://www.postgresql.org/docs/12/using-explain.html#USING-EXPLAIN-CAVEATS

-- 
Justin




Re: Query takes way longer with LIMIT, and EXPLAIN takes way longer than actual query

2020-08-14 Thread Ken Tanzer
On Fri, Aug 14, 2020 at 3:04 PM Justin Pryzby  wrote:

> Due to the over-estimated rowcount, the planner believes that (more) rows
> will
> be output (sooner) than they actually are:
>
>->  Nested Loop Semi Join  (cost=47.11..31367302.81 ROWS=611586
> width=336) (actual time=47.098..97236.123 ROWS=25 loops=1)
>
> So it thinks there's something to be saved/gained by using a plan that has
> a
> low startup cost.  But instead, it ends up running for a substantial
> fraction
> of the total (estimated) cost.
>
> Got it.  Is there any way to address this other than re-writing the
query?  (Statistics? Or something else?)



> As for the "explain is more expensive than the query", that could be due to
> timing overhead, as mentioned here.  Test with "explain (timing off)" ?
> https://www.postgresql.org/docs/12/using-explain.html#USING-EXPLAIN-CAVEATS
>
>
Good call--explain with the timing off showed about the same time as the
actual query.

Thanks!

Ken




> --
> Justin
>


-- 
AGENCY Software
A Free Software data system
By and for non-profits
*http://agency-software.org/ *
*https://demo.agency-software.org/client
*
[email protected]
(253) 245-3801

Subscribe to the mailing list
 to
learn more about AGENCY or
follow the discussion.


Re: Query takes way longer with LIMIT, and EXPLAIN takes way longer than actual query

2020-08-14 Thread Jeff Janes
On Fri, Aug 14, 2020 at 5:35 PM Ken Tanzer  wrote:

> Hi.  I've got a query that runs fine (~50ms).  When I add a "LIMIT 25" to
> it though, it takes way longer.  The query itself then takes about 4.5
> seconds.  And when I do an explain, it takes 90+ seconds for the same query!
>
> Explains and detailed table/view info below.  tbl_log has 1.2M records,
> tbl_reference has 550k.  This is 9.6.19 on CentOS 6 with PDGG packages.
>

CentOS6 has slow clock calls, so it is not surprising that EXPLAIN ANALYZE
with TIMING defaulting to ON is slow.  Using something more modern for the
distribution should really help that, but for the current case just setting
TIMING OFF should be good enough as it is the row counts which are
interesting, not the timing of individual steps.


> I know the query itself could be re-written, but it's coming from an ORM,
> so I'm really focused on why the adding a limit is causing such performance
> degradation, and what to do about it.
>

But if it is coming from an ORM and you can't rewrite it, then what can you
do about it?  Can you set enable_someting or something_cost parameters
locally just for the duration of one query?  If the ORM doesn't let you
re-write, then I doubt it would let you do that, either.  Since you are
using such an old version, you can't create multivariate statistics, either
(although I doubt they would help anyway).

   ->  Nested Loop  (cost=4313.36..14216.18 rows=611586 width=336) (actual
> time=10.837..38.177 rows=1432 loops=1)
>  ->  HashAggregate  (cost=4312.93..4325.68 rows=1275 width=136)
> (actual time=10.802..13.800 rows=1433 loops=1)
>  ->  Index Scan using tbl_log_pkey on public.tbl_log
>  (cost=0.43..7.75 rows=1 width=336) (actual time=0.007..0.009 rows=1
> loops=1433)
>

The way-off row estimate for the nested loop is the cause of the bad plan
choice once you add the LIMIT.  But what is the cause of the bad estimate?
If you just multiply the estimates for each of the child nodes, you get
about the correct answer.  But the estimate for the nested loop is very
different from the product of the children.  On the one hand that isn't
surprising, as the row estimates are computed at each node from first
principles, not computed from the bottom up.  But usually if the stats are
way off, you can follow the error down to a lower level where they are also
way off, but in this case you can't.  That makes it really hard to reason
about what the problem might be.

Can you clone your server, upgrade the clone to 12.4 or 13BETA3 or 14dev,
and see if the problem still exists there?  Can you anonymize your data so
that you can publish an example other people could run themselves to
dissect the problem; or maybe give some queries that generate random data
which have the correct data distribution to reproduce the issue?

Cheers,

Jeff

>


Re: Query takes way longer with LIMIT, and EXPLAIN takes way longer than actual query

2020-08-14 Thread Justin Pryzby
On Fri, Aug 14, 2020 at 03:40:40PM -0700, Ken Tanzer wrote:
> On Fri, Aug 14, 2020 at 3:04 PM Justin Pryzby  wrote:
> > Due to the over-estimated rowcount, the planner believes that (more) rows
> > will be output (sooner) than they actually are:
> >
> >->  Nested Loop Semi Join  (cost=47.11..31367302.81 ROWS=611586
> > width=336) (actual time=47.098..97236.123 ROWS=25 loops=1)
> >
> > So it thinks there's something to be saved/gained by using a plan that has a
> > low startup cost.  But instead, it ends up running for a substantial 
> > fraction
> > of the total (estimated) cost.
>
> Got it.  Is there any way to address this other than re-writing the
> query?  (Statistics? Or something else?)

A usual trick is to change to write something like:
|ORDER BY added_at + '0 seconds'::interval"
which means an index scan on added_at doesn't match the ORDER BY exactly (the
planner isn't smart enough to know better).

You could try to address the misestimate, which is probably a good idea anyway.

Or make it believe that it's going to work harder to return those 25 rows.
Maybe you could change the "25" to a bind parameter, like LIMIT $N, or however
the ORM wants it.

You could change the index to a BRIN index, which doesn't help with ORDER BY
(but will also affect other queries which want to ORDER BY).

Maybe you could add an index on this expression and ANALYZE the table.  I think
it might help the estimate.  Or it might totally change the shape of the plan,
like allowing indexonly scan (which would probably require VACUUM)..
   Group Key: CASE WHEN (tbl_reference.to_table = 'client'::name) THEN 
tbl_reference.from_id WHEN (tbl_reference.from_table = 'client'::name) THEN 
tbl_reference.to_id ELSE NULL::integer END

I know you didn't want to rewrite the query, but it looks to me like adjusting
the schema or query might be desirable.

agency=> EXPLAIN (ANALYZE,VERBOSE,BUFFERS,TIMING) SELECT * FROM Log WHERE 
log_id IN (SELECT CASE WHEN to_table='client' THEN from_id WHEN 
from_table='client' THEN to_id END FROM reference WHERE ((from_id_field = 
E'client_id'
AND from_id =  E'34918'
AND from_table =  E'client'
AND to_table =  E'log'
)
OR  (to_id_field =  E'client_id'
AND to_id =  E'34918'
AND to_table =  E'client'
AND from_table =  E'log'
))) ORDER BY added_at DESC;

To me that smells like a UNION (maybe not UNION ALL):
SELECT FROM log WHERE EXISTS (SELECT 1 FROM reference ref WHERE 
log.log_id=ref.from_id AND to_table='client' AND from_id_field='client_id' AND 
from_id=$1 AND from_table='client' AND to_table='log')
UNION
SELECT FROM log WHERE EXISTS (SELECT 1 FROM reference ref WHERE 
log.log_id=ref.to_id AND from_table='client' AND to_id_field='client_id' AND 
to_id=$1 AND to_table='client' AND from_table='log')

I guess you might know that various indexes are redundant:

"index_tbl_log_log_type_code" btree (log_type_code)
"tbl_log_log_type_code" btree (log_type_code)
"tbl_log_test2" btree (log_type_code, added_at)

"tbl_log_added_at" btree (added_at)
"tbl_log_test" btree (added_at, log_type_code)

"index_tbl_reference_to_table" btree (to_table)
"index_tbl_reference_to_table_id" btree (to_table, to_id)

"index_tbl_reference_is_deleted" btree (is_deleted)
=> Maybe that would be better as a WHERE NOT is_deleted clause on various 
indexes (?)

-- 
Justin