Re: Strategies for converting UTC data to local windows for arbitrary resolutions and timezones

2023-10-04 Thread Marian Wendt
set timezone to 'UTC';
select
    days as start_time,
    lead(days) over (order BY days) as end_time
from generate_series(
    '2023-11-03 00:00:00.00-0400'::timestamptz,
    '2023-11-07 00:00:00.00-0500'::timestamptz,
    '1 day'::interval
) days;

       start_time       |        end_time
+
 2023-11-03 04:00:00+00 | 2023-11-04 04:00:00+00
 2023-11-04 04:00:00+00 | 2023-11-05 04:00:00+00
 2023-11-05 04:00:00+00 | 2023-11-06 04:00:00+00
 2023-11-06 04:00:00+00 | 2023-11-07 04:00:00+00
[etc.]


set timezone to 'America/New_York';
[same as above]

       start_time       |        end_time
+
 2023-11-03 00:00:00-04 | 2023-11-04 00:00:00-04
 2023-11-04 00:00:00-04 | 2023-11-05 00:00:00-04
 2023-11-05 00:00:00-04 | 2023-11-06 00:00:00-05
 2023-11-06 00:00:00-05 | 2023-11-07 00:00:00-05
[etc.]
```

The latter being correct for these purposes, but not seeming super
reliable/practical (see: point 5).

Can anyone think of any other approaches to this problem? I'd be
thrilled if I could manage it without resorting to manually
post-processing in pandas or something, because I have a number of
datapoints, and each individual one is quite wide.

Apologies for the length of this message--just trying to be
thorough. I sincerely appreciate any help or pointers!

Best,
Lincoln


-- 
Lincoln Swaine-Moore




That's a long email to digest but a couple thoughts that may help.

I've always disliked the term "timestamp with time zone" and always 
mentally translate that to "point in time". As long as data is stored 
as a timestamp with time zone (point in time), it can be manipulated 
in any time zone you desire including handling DST.


Time calculation always involves some assumption of meaning and 
PostgreSQL makes some reasonable ones. I'm in US Pacific time and if I 
run:

select '2023-11-05'::timestamptz ;
I get:
 timestamptz

2023-11-05 00:00:00-07

If I add a day (crossing the DST boundary):
select '2023-11-05'::timestamptz + '1 day'::interval;
   ?column?

2023-11-06 00:00:00-08

Observe that I end up at midnight the following day. But if I instead 
add 24 hours:

select '2023-11-05'::timestamptz + '24 hours'::interval;
   ?column?

2023-11-05 23:00:00-08

24 hours is exactly what I get.

You are generating 15-minute intervals the hard way. You can do it 
directly and have DST handled for you:
select generate_series('2023-11-05'::timestamptz, 
'2023-11-06'::timestamptz, '15 minutes'::interval);

    generate_series

2023-11-05 00:00:00-07
2023-11-05 00:15:00-07
2023-11-05 00:30:00-07
2023-11-05 00:45:00-07
2023-11-05 01:00:00-07
2023-11-05 01:15:00-07
2023-11-05 01:30:00-07
2023-11-05 01:45:00-07
2023-11-05 01:00:00-08
2023-11-05 01:15:00-08
...
2023-11-06 00:00:00-08
(101 rows)

Note that 01:00:00-07 is a different point in time than 01:00:00-08. 
Sticking with timestamp with time zone, aka a fully qualified point in 
time, removes any ambiguity. Also observe that I get the correct 
number of "bins". If I run the same thing but for March 12 2023 
(spring forward)  I'll get 93 rows but on "normal days" there will be 97.


I suspect your best bet will be to store the data as type time 
stamp with time zone and to set the time zone before running your queries.


Note that setting the time zone is a client/connection setting so if 
you set it within a transaction, it will stay set when the transaction 
concludes. But time manipulation is tricky and trying to DIY reinvent 
the wheel is painful and often buggy. Let PostgreSQL do the work for you.


Cheers,
Steve




UTC is an excellent form of timestamps as a linear quantity in the db 
(like Steve wrote: "point in time"); no gaps, no duplications.


For the sake of simplicity, I only included the possible date-bin 
variants so that the results can be compared; ordering and grouping with 
just one date_bin etc. can be easily customized...


SET TIMEZONE='Etc/UTC';
SELECT
    sub.gs
    ,date_bin('15 minutes', sub.gs, '2023-01-01') AS norm15minutes
    ,date_bin('1 hours', sub.gs, '2023-01-01') AS norm1hour
    ,date_bin('1 days', sub.gs, '2023-01-01') AS norm1day
FROM (SELECT generate_series('2023-11-03 00:00:00Z'::timestamptz, 
'2023-11-06 00:00:00Z'::timestamptz, '5 minutes'::interval) AS gs) AS sub;


For the WHERE clause also everything in UTC (the conversion of the 
parameters of "user time zone" takes place before).


Conversion of the results into the "user time zone" takes place in the 
client app.

--
regards, marian wendt

Re: Strategies for converting UTC data to local windows for arbitrary resolutions and timezones

2023-10-04 Thread Marian Wendt


Am 04.10.2023 um 16:11 schrieb Tom Lane:

Steve Crawford  writes:

On Tue, Oct 3, 2023 at 7:33 PM Lincoln Swaine-Moore
wrote:

5) Ideally, the solution would not involve messing with the
server/connection's value of timezone. (Though I would be interested if
there was a solution that relaxed this constraint and was relatively
safe/compatible with transactions and psycopg2.)

Note that setting the time zone is a client/connection setting so if you
set it within a transaction, it will stay set when the transaction
concludes. But time manipulation is tricky and trying to DIY reinvent the
wheel is painful and often buggy. Let PostgreSQL do the work for you.

Expanding on that philosophy: you should be able to set the timezone
locally within a function, so that it wouldn't be that hard to make a
wrapper for generate_series that emulates the 4-argument version added
in v16.

Rather than messing with manually saving and restoring the prevailing
zone, I'd let the function SET infrastructure do it for me.  Sadly,
that SET clause only takes literal constant arguments, so it'd go
roughly like this:

create function generate_series(timestamptz, timestamptz, interval, text)
returns setof timestamptz
strict immutable language plpgsql as
$$
begin
   perform set_config('timezone', $4, true);
   return query select generate_series($1, $2, $3);
end
$$ set timezone = 'UTC';

Setting the zone to UTC is a useless step, but that triggers
restoring the previous zone when the function exits; simpler
and probably faster than coding the save/restore explicitly.

Side note: whether this is really "immutable" is a matter for
debate, since time zone definitions tend to change over time.
But we chose to mark the new 4-argument version that way,
so you might as well do so too.

regards, tom lane




As far as Lincoln describes it, the series is not the real problem here, 
but is just intended to be a simplified example of his actual data.The 
consideration that you can use the time zone using a function should 
apply here...The following SELECT should show as an example that the 
desired result can be achieved (TIMEZONE set to 'Etc/UTC'). Variations 
of date_bin for 15 minutes or 1 hour should work similarly...


SELECT
    sub.gs AS ts_in_utc
    ,sub.gs AT TIME ZONE 'America/New_York' AS ts_in_local_tz
    ,date_bin('1 days', sub.gs AT TIME ZONE 'America/New_York', 
'2023-01-01')
FROM (SELECT generate_series('2023-11-03 00:00:00Z'::timestamptz, 
'2023-11-07 00:00:00Z'::timestamptz, '5 minutes'::interval) AS gs) AS sub

WHERE
    sub.gs >= '2023-11-05 00:00:00 America/New_York'::timestamptz AND 
sub.gs < '2023-11-06 00:00:00 America/New_York'::timestamptz


--
regards, marian wendt

Re: Index scan is not pushed down to union all subquery

2023-10-05 Thread Marian Wendt

Hi,

Is it intended that indexes are not pushed down to union all
subqueries if even a single select contains a where clause? Is this
just not implemented, is it impossible to implement or am I doing
something wrong?

The following query does a SeqScan for "bikes" and "cars" tables even
though IndexScan on their column "dealer_name" would be a magnitude
faster. (Schema with sample data at the bottom of this mail)

WITH
   targets as (
 select 'bike' vehicle, id, dealer_name FROM bikes WHERE frame_size = 52
 union all
 select 'car' vehicle, id, dealer_name FROM cars
 -- In the real use case I have here dozens of tables
   )
SELECT
   dealers.name dealer,
   targets.vehicle,
   targets.id
FROM
   dealers
   JOIN targets
 ON dealers.name = targets.dealer_name
   WHERE dealers.id in (54,12,456,315,468)


If the WHERE clause from the "bikes" subquery is removed then first
Index Scan on dealers_pk is made and then Bitmap Index Scans on
"bikes" and "cars" table using indexes on "dealer_name" columns.


---
  -- Available also at:https://www.db-fiddle.com/f/cEXt8HXSaQzsQ2yBDA2Z4H/7
CREATE TABLE dealers AS
SELECT
   id,
   (SELECT string_agg(CHR(65+(random() * 25)::integer), '') FROM
generate_series(1, 4) WHERE id>0) name
FROM generate_series(1, 1000) AS id
;
ALTER TABLE dealers ADD primary key (id);
CREATE INDEX ON dealers(name);


CREATE TABLE bikes AS
SELECT
   generate_series AS id,
   (SELECT name FROM dealers WHERE dealers.id = (SELECT
(random()*1000)::int WHERE generate_series>0)) AS dealer_name,
   (random()*12+50)::int as frame_size
FROM generate_series(1, 10);
ALTER TABLE bikes ADD primary key (id);
CREATE INDEX ON bikes(dealer_name);


CREATE TABLE cars AS
SELECT
   generate_series as id,
   (SELECT name FROM dealers WHERE dealers.id = (SELECT
(random()*1000)::int WHERE generate_series>0)) AS dealer_name,
   (random()*7+14)::int as wheel_size
FROM generate_series(1, 10);
ALTER TABLE cars ADD primary key (id);
CREATE INDEX ON cars(dealer_name);


ANALYZE;


--
- Lauri




With an INNER JOIN, both tables must be fully checked/matched (check 
using EXPLAIN ANALYSIS -> HashJoin), so the index cannot be used here.

--
regards, marian wendt

Re: Index scan is not pushed down to union all subquery

2023-10-05 Thread Marian Wendt

Hi,

Is it intended that indexes are not pushed down to union all
subqueries if even a single select contains a where clause? Is this
just not implemented, is it impossible to implement or am I doing
something wrong?

The following query does a SeqScan for "bikes" and "cars" tables even
though IndexScan on their column "dealer_name" would be a magnitude
faster. (Schema with sample data at the bottom of this mail)

WITH
   targets as (
 select 'bike' vehicle, id, dealer_name FROM bikes WHERE frame_size = 52
 union all
 select 'car' vehicle, id, dealer_name FROM cars
 -- In the real use case I have here dozens of tables
   )
SELECT
   dealers.name dealer,
   targets.vehicle,
   targets.id
FROM
   dealers
   JOIN targets
 ON dealers.name = targets.dealer_name
   WHERE dealers.id in (54,12,456,315,468)


If the WHERE clause from the "bikes" subquery is removed then first
Index Scan on dealers_pk is made and then Bitmap Index Scans on
"bikes" and "cars" table using indexes on "dealer_name" columns.


---
  -- Available also at:https://www.db-fiddle.com/f/cEXt8HXSaQzsQ2yBDA2Z4H/7
CREATE TABLE dealers AS
SELECT
   id,
   (SELECT string_agg(CHR(65+(random() * 25)::integer), '') FROM
generate_series(1, 4) WHERE id>0) name
FROM generate_series(1, 1000) AS id
;
ALTER TABLE dealers ADD primary key (id);
CREATE INDEX ON dealers(name);


CREATE TABLE bikes AS
SELECT
   generate_series AS id,
   (SELECT name FROM dealers WHERE dealers.id = (SELECT
(random()*1000)::int WHERE generate_series>0)) AS dealer_name,
   (random()*12+50)::int as frame_size
FROM generate_series(1, 10);
ALTER TABLE bikes ADD primary key (id);
CREATE INDEX ON bikes(dealer_name);


CREATE TABLE cars AS
SELECT
   generate_series as id,
   (SELECT name FROM dealers WHERE dealers.id = (SELECT
(random()*1000)::int WHERE generate_series>0)) AS dealer_name,
   (random()*7+14)::int as wheel_size
FROM generate_series(1, 10);
ALTER TABLE cars ADD primary key (id);
CREATE INDEX ON cars(dealer_name);


ANALYZE;


--
- Lauri




With an INNER JOIN, both tables must be fully checked/matched (check 
using EXPLAIN ANALYSIS -> HashJoin), so the index cannot be used here.
Sorry, didn't consider the WITH part. Please share the detailed query 
plan for more info.

--
regards, marian wendt

Re: Index scan is not pushed down to union all subquery

2023-10-05 Thread Marian Wendt

Hi,

Is it intended that indexes are not pushed down to union all
subqueries if even a single select contains a where clause? Is this
just not implemented, is it impossible to implement or am I doing
something wrong?

The following query does a SeqScan for "bikes" and "cars" tables even
though IndexScan on their column "dealer_name" would be a magnitude
faster. (Schema with sample data at the bottom of this mail)

WITH
   targets as (
 select 'bike' vehicle, id, dealer_name FROM bikes WHERE frame_size = 52
 union all
 select 'car' vehicle, id, dealer_name FROM cars
 -- In the real use case I have here dozens of tables
   )
SELECT
   dealers.name dealer,
   targets.vehicle,
   targets.id
FROM
   dealers
   JOIN targets
 ON dealers.name = targets.dealer_name
   WHERE dealers.id in (54,12,456,315,468)


If the WHERE clause from the "bikes" subquery is removed then first
Index Scan on dealers_pk is made and then Bitmap Index Scans on
"bikes" and "cars" table using indexes on "dealer_name" columns.


---
  -- Available also at:https://www.db-fiddle.com/f/cEXt8HXSaQzsQ2yBDA2Z4H/7
CREATE TABLE dealers AS
SELECT
   id,
   (SELECT string_agg(CHR(65+(random() * 25)::integer), '') FROM
generate_series(1, 4) WHERE id>0) name
FROM generate_series(1, 1000) AS id
;
ALTER TABLE dealers ADD primary key (id);
CREATE INDEX ON dealers(name);


CREATE TABLE bikes AS
SELECT
   generate_series AS id,
   (SELECT name FROM dealers WHERE dealers.id = (SELECT
(random()*1000)::int WHERE generate_series>0)) AS dealer_name,
   (random()*12+50)::int as frame_size
FROM generate_series(1, 10);
ALTER TABLE bikes ADD primary key (id);
CREATE INDEX ON bikes(dealer_name);


CREATE TABLE cars AS
SELECT
   generate_series as id,
   (SELECT name FROM dealers WHERE dealers.id = (SELECT
(random()*1000)::int WHERE generate_series>0)) AS dealer_name,
   (random()*7+14)::int as wheel_size
FROM generate_series(1, 10);
ALTER TABLE cars ADD primary key (id);
CREATE INDEX ON cars(dealer_name);


ANALYZE;


--
- Lauri




With an INNER JOIN, both tables must be fully checked/matched (check 
using EXPLAIN ANALYSIS -> HashJoin), so the index cannot be used here.
Sorry, didn't consider the WITH part. Please share the detailed query 
plan for more info.
The "bikes" subquery uses field "frame_size" in WHERE clause but the 
field does not have an index...
ADD: Consider whether it might make sense to take a more generalist 
approach by only having one entity vehicle with the distinction "car", 
"bike", etc...?
ADD: Consider to do more complex "detailed" SELECTs that are unioned (if 
that is really needed)?

--
regards, marian wendt