Re: Execution plan does not use index

2020-11-10 Thread Peter Coppens
Pavel

Tx for the tip. But given that if the I64_01 column is removed from the select 
list, the index is used I guess the cast is not likely to be the cause.

Like so

explain 
  select d.short_id,mv.timestamp --,mv.I64_01 
  from device d, device_configuration dc, measurement_value mv, 
pg_timezone_names tzn 
  where mv.device_id=d.short_id and dc.device_id = d.id and dc.latest=true and 
dc.timezone=tzn.name and 
mv.timestamp > '2020-11-06'::timestamp - tzn.utc_offset and 
mv.timestamp < '2020-11-07'::timestamp - tzn.utc_offset


==>

Nested Loop  (cost=1.13..6217004.08 rows=60425437 width=12)
  ->  Nested Loop  (cost=0.56..21334.84 rows=2186 width=20)
Join Filter: (dc.timezone = pg_timezone_names.name)
->  Nested Loop  (cost=0.56..7497.34 rows=615 width=18)
  ->  Index Scan using device_short_id_key on device d  
(cost=0.28..2423.90 rows=683 width=20)
  ->  Index Scan using device_configuration_device_latest_idx on 
device_configuration dc  (cost=0.28..7.42 rows=1 width=30)
Index Cond: ((device_id = d.id) AND (latest = true))
Filter: latest
->  Function Scan on pg_timezone_names  (cost=0.00..10.00 rows=1000 
width=48)
  ->  Index Only Scan using measurement_values_pkey on measurement_value mv  
(cost=0.57..2399.33 rows=43492 width=12)
Index Cond: ((device_id = d.short_id) AND ("timestamp" > ('2020-11-06 
00:00:00'::timestamp without time zone - pg_timezone_names.utc_offset)) AND 
("timestamp" < ('2020-11-07 00:00:00'::timestamp without time zone - 
pg_timezone_names.utc_offset)))


Peter

> On 10 Nov 2020, at 08:25, Pavel Stehule  wrote:
> 
> 
> 
> út 10. 11. 2020 v 8:18 odesílatel Peter Coppens  > napsal:
> Michael
> 
> Many thanks for spending your time on this. Your alternative does not help 
> unfortunately (see execution plan)
> 
> Still a sequential scan on the complete table. I have tried many alternatives 
> and somehow whenever I add a column that is not in the index (I64_01) the 
> optimizer decides not to use the index. If I remove that column, the index is 
> used. I guess it estimates that the extra indirection from index pages to the 
> row pages is more costly than scanning the 168M records. Pretty sure it’s 
> not, but I cannot explain it to the stubborn thing :)
> 
> Btw, thanks for the >= tip (I was aware of it)
> 
> Wkr,
> 
> Peter
> 
> 
> Hash Join  (cost=683.93..7270857.46 rows=458127 width=20)
>   Hash Cond: (mv_inner.device_id = d.short_id)
>   Join Filter: ((mv_inner."timestamp" > ('2020-11-06 00:00:00'::timestamp 
> without time zone - pg_timezone_names.utc_offset)) AND (mv_inner."timestamp" 
> < ('2020-11-07 00:00:00'::timestamp without time zone - 
> pg_timezone_names.utc_offset)))
>   ->  Seq Scan on measurement_value mv_inner  (cost=0.00..7166797.33 
> rows=1287989 width=1006)
> Filter: (("timestamp" > '2020-11-05 00:00:00'::timestamp without time 
> zone) AND ("timestamp" < '2020-11-08 00:00:00'::timestamp without time zone))
> 
> when you see cast in filter, then you should check type equality in 
> constraints. With some exception Postgres uses indexes only when filtered 
> value has same type like column type.
> 
> Maybe there is inconsistency between timestamp (with time zone), and 
> timestamp without time zone
> 
> Regards
> 
> Pavel
> 
> 
>   ->  Hash  (cost=656.61..656.61 rows=2186 width=20)
> ->  Hash Join  (cost=77.87..656.61 rows=2186 width=20)
>   Hash Cond: (dc.timezone = pg_timezone_names.name 
> )
>   ->  Hash Join  (cost=55.37..533.83 rows=615 width=18)
> Hash Cond: (dc.device_id = d.id )
> ->  Seq Scan on device_configuration dc  
> (cost=0.00..470.01 rows=615 width=30)
>   Filter: latest
> ->  Hash  (cost=46.83..46.83 rows=683 width=20)
>   ->  Seq Scan on device d  (cost=0.00..46.83 
> rows=683 width=20)
>   ->  Hash  (cost=10.00..10.00 rows=1000 width=48)
> ->  Function Scan on pg_timezone_names  (cost=0.00..10.00 
> rows=1000 width=48)
> 
> 
> 
>> On 10 Nov 2020, at 01:15, Michael Lewis > > wrote:
>> 
>> On Mon, Nov 9, 2020 at 1:11 PM Peter Coppens > > wrote:
>> Adding the tzn.utc_offset results in the fact that the execution plan no 
>> longer considers to use the index on the measurement_value table. Is there 
>> any way the SQL can be rewritten so that the index is used? Or any other 
>> solution so that the query with the timezone offset returns in a comparable 
>> time?
>> 
>> I am not aware of a best practice to handle this. Your where condition on 
>> mv.timestamp now depends on several joins to do a filtering that used to be 
>> a static range that can be scanned into the index as a first node in the 
>> plan. I have sometimes u

Re: Execution plan does not use index

2020-11-10 Thread Peter Coppens
Triggered by Michael mentioning subqueries I ended up trying

explain
  select d.short_id,mv.timestamp ,mv.I64_01
  from device d, device_configuration dc, measurement_value mv
  where mv.device_id=d.short_id and dc.device_id = d.id and dc.latest=true and 
mv.timestamp > '2020-11-06'::timestamp - (select tzn.utc_offset from 
pg_timezone_names tzn where tzn.name=dc.timezone) and mv.timestamp < 
'2020-11-07'::timestamp - (select tzn.utc_offset from pg_timezone_names tzn 
where tzn.name=dc.timezone)

==>
Nested Loop  (cost=25.85..84540074.64 rows=16996885 width=20)
  ->  Seq Scan on device_configuration dc  (cost=0.00..470.01 rows=615 width=30)
Filter: latest
  ->  Nested Loop  (cost=25.85..137027.83 rows=43494 width=36)
->  Index Scan using device_pkey on device d  (cost=0.28..7.23 rows=1 
width=20)
  Index Cond: (id = dc.device_id)
->  Index Scan using measurement_values_pkey on measurement_value mv  
(cost=25.58..136585.66 rows=43494 width=20)
  Index Cond: ((device_id = d.short_id) AND ("timestamp" > 
('2020-11-06 00:00:00'::timestamp without time zone - (SubPlan 1))) AND 
("timestamp" < ('2020-11-07 00:00:00'::timestamp without time zone - (SubPlan 
2
  SubPlan 1
->  Function Scan on pg_timezone_names  (cost=0.00..12.50 
rows=5 width=16)
  Filter: (name = dc.timezone)
  SubPlan 2
->  Function Scan on pg_timezone_names pg_timezone_names_1  
(cost=0.00..12.50 rows=5 width=16)
  Filter: (name = dc.timezone)


Now returns the 320K in less than 5sec. 

I was till now convinced that correlated subqueries or joins are equivalent. I 
guess I was wrong :). Wonder how stable this plan will be though

Peter

> On 10 Nov 2020, at 09:06, Peter Coppens  wrote:
> 
> Pavel
> 
> Tx for the tip. But given that if the I64_01 column is removed from the 
> select list, the index is used I guess the cast is not likely to be the cause.
> 
> Like so
> 
> explain 
>   select d.short_id,mv.timestamp --,mv.I64_01 
>   from device d, device_configuration dc, measurement_value mv, 
> pg_timezone_names tzn 
>   where mv.device_id=d.short_id and dc.device_id = d.id  and 
> dc.latest=true and dc.timezone=tzn.name and 
> mv.timestamp > '2020-11-06'::timestamp - tzn.utc_offset and 
> mv.timestamp < '2020-11-07'::timestamp - tzn.utc_offset
> 
> 
> ==>
> 
> Nested Loop  (cost=1.13..6217004.08 rows=60425437 width=12)
>   ->  Nested Loop  (cost=0.56..21334.84 rows=2186 width=20)
> Join Filter: (dc.timezone = pg_timezone_names.name)
> ->  Nested Loop  (cost=0.56..7497.34 rows=615 width=18)
>   ->  Index Scan using device_short_id_key on device d  
> (cost=0.28..2423.90 rows=683 width=20)
>   ->  Index Scan using device_configuration_device_latest_idx on 
> device_configuration dc  (cost=0.28..7.42 rows=1 width=30)
> Index Cond: ((device_id = d.id ) AND 
> (latest = true))
> Filter: latest
> ->  Function Scan on pg_timezone_names  (cost=0.00..10.00 rows=1000 
> width=48)
>   ->  Index Only Scan using measurement_values_pkey on measurement_value mv  
> (cost=0.57..2399.33 rows=43492 width=12)
> Index Cond: ((device_id = d.short_id) AND ("timestamp" > ('2020-11-06 
> 00:00:00'::timestamp without time zone - pg_timezone_names.utc_offset)) AND 
> ("timestamp" < ('2020-11-07 00:00:00'::timestamp without time zone - 
> pg_timezone_names.utc_offset)))
> 
> 
> Peter
> 
>> On 10 Nov 2020, at 08:25, Pavel Stehule > > wrote:
>> 
>> 
>> 
>> út 10. 11. 2020 v 8:18 odesílatel Peter Coppens > > napsal:
>> Michael
>> 
>> Many thanks for spending your time on this. Your alternative does not help 
>> unfortunately (see execution plan)
>> 
>> Still a sequential scan on the complete table. I have tried many 
>> alternatives and somehow whenever I add a column that is not in the index 
>> (I64_01) the optimizer decides not to use the index. If I remove that 
>> column, the index is used. I guess it estimates that the extra indirection 
>> from index pages to the row pages is more costly than scanning the 168M 
>> records. Pretty sure it’s not, but I cannot explain it to the stubborn thing 
>> :)
>> 
>> Btw, thanks for the >= tip (I was aware of it)
>> 
>> Wkr,
>> 
>> Peter
>> 
>> 
>> Hash Join  (cost=683.93..7270857.46 rows=458127 width=20)
>>   Hash Cond: (mv_inner.device_id = d.short_id)
>>   Join Filter: ((mv_inner."timestamp" > ('2020-11-06 00:00:00'::timestamp 
>> without time zone - pg_timezone_names.utc_offset)) AND (mv_inner."timestamp" 
>> < ('2020-11-07 00:00:00'::timestamp without time zone - 
>> pg_timezone_names.utc_offset)))
>>   ->  Seq Scan on measurement_value mv_inner  (cost=0.00..7166797.33 
>> rows=1287989 width=1006)
>> Filter: (("timestamp" > '2020-11-05 00:00:00'::timestamp without 
>> 

Re: Different result behavior when using ANY(ARRAY(SELECT)) and IN (SELECT)

2020-11-10 Thread Davide Jensen
Thanks for the suggestion,
I've tried to edit the query and added an order by in the OVER statement
but i'm still obtaining two results, but now of course those results are
sequential, the edited beginning of the query is:

 SELECT *
FROM   (
  SELECT _id,
 ROW_NUMBER() OVER (ORDER BY _id ASC) AS _rownumber
...

The results of 3 executions:
* 1) 64297639;22326
64297639;22327
* 2) 64297639;22942
64297639;22943
* 3) 64297639;22072
64297639;22073

Regarding the fictional row number, it's needed for some internal handling
of the results, the query is generated by a java application.
Thanks,
Davide Jensen

Il giorno lun 9 nov 2020 alle ore 16:10 Tom Lane  ha
scritto:

> Davide Jensen  writes:
> > I'm encountering some problems in understanding the behavior of a query
> > that uses an IN operator, the query i'm executing is the following:
>
> > SELECT * FROM   (
> >   SELECT _id,
> >  ROW_NUMBER() OVER () AS _rownumber
>
> I think your problem is in the above, not anything later in the query.
> ROW_NUMBER without any ordering specification is indeterminate.
> If the query gets parallelized, it's no surprise that the selected rows
> will be processed in varying order from one try to the next.  Your
> second phrasing of the query seems to be non-parallelizable, but the
> row_number() result is still pretty indeterminate; it just doesn't
> happen to have changed within your test run.
>
> What is it you are expecting to get out of including a fictional
> row number in the query result, anyway?
>
> regards, tom lane
>


psycopg3 and adaptation choices

2020-11-10 Thread Daniele Varrazzo
Hello,

I am developing psycopg3 , a major
release of the psycopg2 PostgreSQL driver for Python. This release
finally uses the advanced query protocol and server-side parameters
binding, so adaptation of Python values to PostgreSQL data types has,
at times, a different behaviour compared to the past. 100%
compatibility is not a target: that's the whole point of creating a
new package, but I am studying what is the sanest behaviour adaptation
can have.

I am exploring the best trade-offs in the choice of the mapping across
types: I have a write-up at
 and
would like to get input from PostgreSQL people regards whether the
choices we are making are the best possible ones.

Looking forward to get your feedback, thank you very much!

-- Daniele




Re: Execution plan does not use index

2020-11-10 Thread Michael Lewis
On Tue, Nov 10, 2020 at 1:25 AM Peter Coppens 
wrote:

> Triggered by Michael mentioning subqueries I ended up trying
>
> explain
>   select d.short_id,mv.timestamp ,mv.I64_01
>   from device d, device_configuration dc, measurement_value mv
>   where mv.device_id=d.short_id and dc.device_id = d.id and
> dc.latest=true and
> mv.timestamp > '2020-11-06'::timestamp - (select tzn.utc_offset
> from pg_timezone_names tzn where tzn.name=dc.timezone) and mv.timestamp <
> '2020-11-07'::timestamp - (select tzn.utc_offset from pg_timezone_names tzn
> where tzn.name=dc.timezone)
>
> ==>
> Nested Loop  (cost=25.85..84540074.64 rows=16996885 width=20)
>   ->  Seq Scan on device_configuration dc  (cost=0.00..470.01 rows=615
> width=30)
> Filter: latest
>   ->  Nested Loop  (cost=25.85..137027.83 rows=43494 width=36)
> ->  Index Scan using device_pkey on device d  (cost=0.28..7.23
> rows=1 width=20)
>   Index Cond: (id = dc.device_id)
> *->  Index Scan using measurement_values_pkey on measurement_value
> mv  (cost=25.58..136585.66 rows=43494 width=20)*
> *  Index Cond: ((device_id = d.short_id) AND ("timestamp" >
> ('2020-11-06 00:00:00'::timestamp without time zone - (SubPlan 1))) AND
> ("timestamp" < ('2020-11-07 00:00:00'::timestamp without time zone -
> (SubPlan 2*
>   SubPlan 1
> ->  Function Scan on pg_timezone_names  (cost=0.00..12.50
> rows=5 width=16)
>   Filter: (name = dc.timezone)
>   SubPlan 2
> ->  Function Scan on pg_timezone_names pg_timezone_names_1
>  (cost=0.00..12.50 rows=5 width=16)
>   Filter: (name = dc.timezone)
>
>
> Now returns the 320K in less than 5sec.
>
> I was till now convinced that correlated subqueries or joins are
> equivalent. I guess I was wrong :). Wonder how stable this plan will be
> though
>

Curious, what is seq_page_cost and random_page_cost? Any idea of your cache
hits for indexes? If they are very high and/or you have SSD or similar fast
storage, then maybe random_page_cost should be 1.1-2 and not default 4
(assuming seq_page_cost is still 1). The planner will be more likely to use
an index scan if the expected cost for scanning an index (random) is closer
to a sequential read. Sorry if this explanation is completely superfluous
and you have already configured this.

It would be interesting to see explain (analyze buffers) output so we can
see the actual counts for these nodes. I'm rather surprised that the query
I provided didn't use the timestamp index unless the inclusion of 3 days
worth of range meant that it was estimated to be too high a fraction of the
table. If you just execute only the subquery, is the index used by chance?

Top posting (reply with all previous copied below) is discouraged on these
lists. I think because it makes the archives rather messy.


Re: Execution plan does not use index

2020-11-10 Thread Peter Coppens
> 
> Curious, what is seq_page_cost and random_page_cost?

show seq_page_cost
->1

show random_page_cost
->4


> Any idea of your cache hits for indexes?

No, I am afraid not. It’s been a long time since I went that deep in the RDBMS 
behaviour (must have been Oracle 7.something  :) )


> If they are very high and/or you have SSD or similar fast storage, then maybe 
> random_page_cost should be 1.1-2 and not default 4 (assuming seq_page_cost is 
> still 1). The planner will be more likely to use an index scan if the 
> expected cost for scanning an index (random) is closer to a sequential read. 
> Sorry if this explanation is completely superfluous and you have already 
> configured this.

I played around with some of the settings (also with setting enable_seqscan to 
false). While some of the plans did no longer use a full table scan, the 
performance did not really work out when I tried. Tbh, I never got too deep in 
those plans to find out why that would be, as I am not too keen on changing 
these settings in the first place.

> 
> It would be interesting to see explain (analyze buffers) output so we can see 
> the actual counts for these nodes. I'm rather surprised that the query I 
> provided didn't use the timestamp index unless the inclusion of 3 days worth 
> of range meant that it was estimated to be too high a fraction of the table. 
> If you just execute only the subquery, is the index used by chance?
> 
Index is not used for the subquery

explain 
select mv_inner.*
  from measurement_value AS mv_inner
  where mv_inner.timestamp > '2020-11-06'::timestamp - interval '1 day' and 
mv_inner.timestamp < '2020-11-07'::timestamp + interval '1 day'
offset 0

==>

Seq Scan on measurement_value mv_inner (cost=0.00..7175777.00 rows=1219616 
width=1006)
  Filter: (("timestamp" > '2020-11-05 00:00:00'::timestamp without time zone) 
AND ("timestamp" < '2020-11-08 00:00:00'::timestamp without time zone))



> Top posting (reply with all previous copied below) is discouraged on these 
> lists. I think because it makes the archives rather messy. 

Will try to remember

Re: Execution plan does not use index

2020-11-10 Thread Michael Lewis
On Tue, Nov 10, 2020, 3:24 PM Peter Coppens 
wrote:

> Index is not used for the subquery
>
> explain
> select mv_inner.*
>   from measurement_value AS mv_inner
>   where mv_inner.timestamp > '2020-11-06'::timestamp - interval '1 day'
> and mv_inner.timestamp < '2020-11-07'::timestamp + interval '1 day'
> offset 0
>
> ==>
>
> Seq Scan on measurement_value mv_inner (cost=0.00..7175777.00 rows=1219616
> width=1006)
>   Filter: (("timestamp" > '2020-11-05 00:00:00'::timestamp without time
> zone) AND ("timestamp" < '2020-11-08 00:00:00'::timestamp without time
> zone))
>


Curious, how accurate is that row count of 1.2 million records for 3 days?
How many total records in the table? If you disable sequential scan, does
it choose the index and what cost does it show?


Re: Execution plan does not use index

2020-11-10 Thread Peter Coppens
> Curious, how accurate is that row count of 1.2 million records for 3 days?


Not to bad actually

select count(mv_inner.*)
from measurement_value AS mv_inner
where mv_inner.timestamp > '2020-11-06'::timestamp - interval '1 day' and 
mv_inner.timestamp < '2020-11-07'::timestamp + interval '1 day'
offset 0

==> 1128736


> How many total records in the table?

±168 million

> If you disable sequential scan, does it choose the index and what cost does 
> it show?

It chooses the index, but apparently to create some intermediate structure that 
then later still needs to be joined on the device_id. Probably requires 
scanning all pages of the index, which might explain why the performance is 
still not ok

set enable_seqscan to false;
explain
select d.short_id,mv.timestamp,mv.I64_01
  from 
device d
, device_configuration dc
, (
  select mv_inner.*
from measurement_value AS mv_inner
where mv_inner.timestamp > '2020-11-06'::timestamp - interval '1 day' 
and mv_inner.timestamp < '2020-11-07'::timestamp + interval '1 day'
offset 0
  ) mv
, pg_timezone_names tzn
  where mv.device_id=d.short_id and dc.device_id = d.id and dc.latest=true and 
dc.timezone=tzn.name and
mv.timestamp > '2020-11-06'::timestamp - tzn.utc_offset and 
mv.timestamp < '2020-11-07'::timestamp - tzn.utc_offset

==>


Hash Join  (cost=6677594.18..9545649.57 rows=434126 width=20)
  Hash Cond: (mv_inner.device_id = d.short_id)
  Join Filter: ((mv_inner."timestamp" > ('2020-11-06 00:00:00'::timestamp 
without time zone - pg_timezone_names.utc_offset)) AND (mv_inner."timestamp" < 
('2020-11-07 00:00:00'::timestamp without time zone - 
pg_timezone_names.utc_offset)))
  ->  Bitmap Heap Scan on measurement_value mv_inner  
(cost=6676540.29..9446603.90 rows=1220458 width=1006)
Recheck Cond: (("timestamp" > '2020-11-05 00:00:00'::timestamp without 
time zone) AND ("timestamp" < '2020-11-08 00:00:00'::timestamp without time 
zone))
->  Bitmap Index Scan on measurement_values_pkey  
(cost=0.00..6676235.18 rows=1220458 width=0)
  Index Cond: (("timestamp" > '2020-11-05 00:00:00'::timestamp 
without time zone) AND ("timestamp" < '2020-11-08 00:00:00'::timestamp without 
time zone))
  ->  Hash  (cost=1026.55..1026.55 rows=2187 width=20)
->  Hash Join  (cost=471.95..1026.55 rows=2187 width=20)
  Hash Cond: (dc.timezone = pg_timezone_names.name)
  ->  Hash Join  (cost=449.45..903.76 rows=615 width=18)
Hash Cond: (dc.device_id = d.id)
->  Bitmap Heap Scan on device_configuration dc  
(cost=242.72..688.58 rows=615 width=30)
  Filter: latest
  ->  Bitmap Index Scan on 
device_configuration_device_latest_idx  (cost=0.00..242.57 rows=615 width=0)
Index Cond: (latest = true)
->  Hash  (cost=198.19..198.19 rows=683 width=20)
  ->  Index Scan using device_short_id_key on device d  
(cost=0.28..198.19 rows=683 width=20)
  ->  Hash  (cost=10.00..10.00 rows=1000 width=48)
->  Function Scan on pg_timezone_names  (cost=0.00..10.00 
rows=1000 width=48)