from:"Didier Carlier"

Finding out why parallel queries not avoided

2018-07-21 Thread Didier Carlier

I’m trying to find out why parallel queries are sometimes not used.

For example, I have 2 tables, calendar (1 row per day, ~3K rows) and measure 
(~300M rows) which includes a FK to calendar.

I.e knowing two day numbers, I can find out how many measures there are between 
these two dates with a 
select count(*) from measure m where m.fromdateid >=1462 and m.fromdateid < 
1826;
(1462 and 1826 are the calendar ids corresponding to 2015-01-01 and 2015-12-31)

This uses parallel query:
explain select count(*) from measure m where m.fromdateid >=1462 and 
m.fromdateid < 1826;
 QUERY PLAN 
 
--
Finalize Aggregate  (cost=3894860.64..3894860.65 rows=1 width=8)
  ->  Gather  (cost=3894860.61..3894860.62 rows=8 width=8)
Workers Planned: 8
->  Partial Aggregate  (cost=3894860.61..3894860.62 rows=1 width=8)
  ->  Parallel Bitmap Heap Scan on measure m  
(cost=11265.96..3881068.52 rows=5516835 width=0)
Recheck Cond: ((fromdateid >= 1462) AND (fromdateid < 1826))
->  Bitmap Index Scan on idx_measure_fromdate  
(cost=0.00..232.29 rows=44134699 width=0)
  Index Cond: ((fromdateid >= 1462) AND (fromdateid < 
1826))


The “equivalent" query without hard coding the day numbers gives this query 
plan:

explain select count(*) from calendar c1, calendar c2, measure m where 
 c1.stddate='2015-01-01' and c2.stddate='2015-12-31' and m.fromdateid 
>=c1.calendarid and m.fromdateid < c2.calendarid;
  QUERY PLAN
  
--
 Aggregate  (cost=5073362.73..5073362.74 rows=1 width=8)
   ->  Nested Loop  (cost=8718.47..4988195.81 rows=34066770 width=0)
 ->  Index Scan using calendar_stddate_unique on calendar c2  
(cost=0.28..2.30 rows=1 width=4)
   Index Cond: (stddate = '2015-12-31 00:00:00+01'::timestamp with 
time zone)
 ->  Nested Loop  (cost=8718.19..4647525.81 rows=34066770 width=4)
   ->  Index Scan using calendar_stddate_unique on calendar c1  
(cost=0.28..2.30 rows=1 width=4)
 Index Cond: (stddate = '2015-01-01 00:00:00+01'::timestamp 
with time zone)
   ->  Bitmap Heap Scan on measure m  (cost=8717.91..4306855.81 
rows=34066770 width=4)
 Recheck Cond: ((fromdateid >= c1.calendarid) AND 
(fromdateid < c2.calendarid))
 ->  Bitmap Index Scan on idx_measure_fromdate  
(cost=0.00..201.22 rows=34072527 width=0)
   Index Cond: ((fromdateid >= c1.calendarid) AND 
(fromdateid < c2.calendarid))

Both queries return the same answers but I don't see why the second one doesn't 
use parallel query.
I've tried a few different ways to express the same thing, e.g subselect, CTE 
etc in order to try to ease the query planner work but it always avoids the 
parallel query.
I also set the parallel_tuple_cost and parallel_setup_cost to 0 without success.

Any idea ? Or is there a way to ask the query planner more details about the 
decisions it makes ?

Kind regards,
Didier

Re: Finding out why parallel queries not avoided

2018-07-22 Thread Didier Carlier




> On 22 Jul 2018, at 05:45, David Rowley  wrote:
> 
> On 21 July 2018 at 20:15, Didier Carlier  wrote:
>> explain select count(*) from calendar c1, calendar c2, measure m where
>> c1.stddate='2015-01-01' and c2.stddate='2015-12-31' and m.fromdateid 
>> >=c1.calendarid and m.fromdateid < c2.calendarid;
>>  QUERY PLAN
>> --
>> Aggregate  (cost=5073362.73..5073362.74 rows=1 width=8)
>>   ->  Nested Loop  (cost=8718.47..4988195.81 rows=34066770 width=0)
>> ->  Index Scan using calendar_stddate_unique on calendar c2  
>> (cost=0.28..2.30 rows=1 width=4)
>>   Index Cond: (stddate = '2015-12-31 00:00:00+01'::timestamp 
>> with time zone)
>> ->  Nested Loop  (cost=8718.19..4647525.81 rows=34066770 width=4)
>>   ->  Index Scan using calendar_stddate_unique on calendar c1  
>> (cost=0.28..2.30 rows=1 width=4)
>> Index Cond: (stddate = '2015-01-01 
>> 00:00:00+01'::timestamp with time zone)
>>   ->  Bitmap Heap Scan on measure m  (cost=8717.91..4306855.81 
>> rows=34066770 width=4)
>> Recheck Cond: ((fromdateid >= c1.calendarid) AND 
>> (fromdateid < c2.calendarid))
>> ->  Bitmap Index Scan on idx_measure_fromdate  
>> (cost=0.00..201.22 rows=34072527 width=0)
>>   Index Cond: ((fromdateid >= c1.calendarid) AND 
>> (fromdateid < c2.calendarid))
>> 
>> Both queries return the same answers but I don't see why the second one 
>> doesn't use parallel query.
> 
> You'd likely be better of writing the query as:
> 
> select count(*) from measure where fromdateid >= (select calendarid
> from calendar where stddate = '2015-01-01') and fromdateid < (select
> calendarid from calendar where stddate = '2015-12-31');
> 
> The reason you get the poor nested loop plan is that nested loop is
> the only join method that supports non-equijoin.

It doesn’t use a parallel query but It’s faster indeed, (~12 sec vs 9sec), 
thanks for the info.

> 
> Unsure why you didn't get a parallel plan. Parallel in pg10 supports a
> few more plan shapes than 9.6 did. Unsure what version you're using.

It’s on 10.3 which is the latest available package prebuilt for SmartOS

Finding out why parallel queries not avoided

Re: Finding out why parallel queries not avoided

2 matches

Site Navigation

Mail list logo

Footer information