Strategies for converting UTC data to local windows for arbitrary resolutions and timezones

2023-10-03 Thread Lincoln Swaine-Moore
nt, though I'd be curious to hear if there are discrepancies.

On this point, I've read up on some of the history around this feature, and
was a little puzzled by this assertion in this thread:
https://www.postgresql.org/message-id/87in1k73nr@news-spur.riddles.org.uk
:

> If you have a timestamp-in-UTC column and want to do a date_trunc in some
other specified zone (that's not the session timezone), you need FOUR uses
of AT TIME ZONE to do it correctly:
> date_trunc('day', col AT TIME ZONE 'UTC' AT TIME ZONE 'Asia/Kathmandu')
AT TIME ZONE 'Asia/Kathmandu' AT TIME ZONE 'UTC'

I *think* that this doesn't comport with behavior I've seen, because at a
DST boundary the initial `col AT TIME ZONE 'UTC' AT TIME ZONE
'Asia/Kathmandu'` would drop the offset, which from then on would be
unrecoverable. For example:

```
select date_trunc('hour', '2023-11-05 05:30:00.00'::timestamp AT TIME
ZONE 'UTC' AT TIME ZONE 'America/New_York') AT TIME ZONE 'America/New_York'
AT TIME ZONE 'UTC';
  timezone
-
 2023-11-05 06:00:00

select date_trunc('hour', '2023-11-05 06:30:00.00'::timestamp AT TIME
ZONE 'UTC' AT TIME ZONE 'America/New_York') AT TIME ZONE 'America/New_York'
AT TIME ZONE 'UTC';
  timezone
-
 2023-11-05 06:00:00
```

This is what I meant above by "clumping" in point (3). I believe this issue
is ameliorated by setting the database timezone:

```
set timezone to 'America/New_York';
select date_trunc('hour', '2023-11-05 06:30:00.00'::timestamp AT TIME
ZONE 'UTC');
   date_trunc

 2023-11-05 01:00:00-05

select date_trunc('hour', '2023-11-05 05:30:00.00'::timestamp AT TIME
ZONE 'UTC');
   date_trunc

 2023-11-05 01:00:00-04
```

The same (correct, in my opinion) behavior seems to occur when using the
third argument to date_trunc, so all is well on that front, but I'm being
pedantic about this because if I was misunderstanding and usage of four `at
time zone`s was a legitimate strategy, it could plausibly be applied to the
below. Hoping to be incorrect here!


2) date_bin: this will take a "stride", which should accommodate all the
periods I'm interested in, and "origin" to help deal with initial day/hour
offsets. But I'm having trouble getting it to respect time zone/DST stuff.
For example:

```
[etc]
select
date_bin('24 hours', t, '2023-11-03 00:00:00.00-0400'::timestamptz),
min(t),
max(t),
count(*)
from
original_data
group by 1
order by 1;

date_bin|  min   |  max   |
count
+++---
 2023-11-04 00:00:00-04 | 2023-11-04 20:00:00-04 | 2023-11-04 23:45:00-04 |
   16
 2023-11-05 00:00:00-04 | 2023-11-05 00:00:00-04 | 2023-11-05 22:45:00-05 |
   96
 2023-11-05 23:00:00-05 | 2023-11-05 23:00:00-05 | 2023-11-06 22:45:00-05 |
   96
 2023-11-06 23:00:00-05 | 2023-11-06 23:00:00-05 | 2023-11-07 22:45:00-05 |
   96
 2023-11-07 23:00:00-05 | 2023-11-07 23:00:00-05 | 2023-11-08 22:45:00-05 |
   96
 2023-11-08 23:00:00-05 | 2023-11-08 23:00:00-05 | 2023-11-09 22:45:00-05 |
   96
 2023-11-09 23:00:00-05 | 2023-11-09 23:00:00-05 | 2023-11-10 22:45:00-05 |
   96
[etc]
```

As you can see, the days get "bumped" by DST, and are off by one (until the
spring). I actually think this makes intuitive sense for this concept of
"stride"--it just means that it's not as useful for timezones with DST.


3) generate_series: If I could get this working, it would be ideal, because
it would also help fill in gaps in my data with null rows. But again, I
can't seem to, except in v16, have it respect timezones other than the
server's:

```
set timezone to 'UTC';
select
days as start_time,
lead(days) over (order BY days) as end_time
from generate_series(
'2023-11-03 00:00:00.00-0400'::timestamptz,
'2023-11-07 00:00:00.00-0500'::timestamptz,
'1 day'::interval
) days;

   start_time   |end_time
+
 2023-11-03 04:00:00+00 | 2023-11-04 04:00:00+00
 2023-11-04 04:00:00+00 | 2023-11-05 04:00:00+00
 2023-11-05 04:00:00+00 | 2023-11-06 04:00:00+00
 2023-11-06 04:00:00+00 | 2023-11-07 04:00:00+00
[etc.]


set timezone to 'America/New_York';
[same as above]

   start_time   |end_time
----+----
 2023-11-03 00:00:00-04 | 2023-11-04 00:00:00-04
 2023-11-04 00:00:00-04 | 2023-11-05 00:00:00-04
 2023-11-05 00:00:00-04 | 2023-11-06 00:00:00-05
 2023-11-06 00:00:00-05 | 2023-11-07 00:00:00-05
[etc.]
```

The latter being correct for these purposes, but not seeming super
reliable/practical (see: point 5).

Can anyone think of any other approaches to this problem? I'd be thrilled
if I could manage it without resorting to manually post-processing in
pandas or something, because I have a number of datapoints, and each
individual one is quite wide.

Apologies for the length of this message--just trying to be thorough. I
sincerely appreciate any help or pointers!

Best,
Lincoln


-- 
Lincoln Swaine-Moore


Re: Strategies for converting UTC data to local windows for arbitrary resolutions and timezones

2023-10-04 Thread Lincoln Swaine-Moore
Thank you to all who have weighed in! Very much appreciated.

A few thoughts based on what I've read:

> As far as Lincoln describes it, the series is not the real problem here,
but is just intended to be a simplified example of his actual data.

This both is and isn't the case. I was using gneerate_series to create some
data for testing purposes, but I also would love to be able to use
generate_series for the logic as well.

> SELECT
> sub.gs AS ts_in_utc
> ,sub.gs AT TIME ZONE 'America/New_York' AS ts_in_local_tz
> ,date_bin('1 days', sub.gs AT TIME ZONE 'America/New_York',
> '2023-01-01')
> FROM (SELECT generate_series('2023-11-03 00:00:00Z'::timestamptz,
> '2023-11-07 00:00:00Z'::timestamptz, '5 minutes'::interval) AS gs) AS sub
> WHERE
> sub.gs >= '2023-11-05 00:00:00 America/New_York'::timestamptz AND
> sub.gs < '2023-11-06 00:00:00 America/New_York'::timestamptz

I believe this query will be funky around DST borders, because `sub.gs AT
TIME ZONE 'America/New_York'` will be localized in a way that erases the
difference between hours with different offsets, which are genuinely
different. For instance, I ran this and there are two rows within it that
look like:

` 2023-11-05 05:00:00+00 | 2023-11-05 01:00:00 | 2023-11-05 00:00:00`
and
` 2023-11-05 06:00:00+00 | 2023-11-05 01:00:00 | 2023-11-05 00:00:00`

I think that the non-unique second column will pose an issue for the date
binning at a resolution finer than 1 day.

> Note that setting the time zone is a client/connection setting so if you
> set it within a transaction, it will stay set when the transaction
> concludes. But time manipulation is tricky and trying to DIY reinvent the
> wheel is painful and often buggy. Let PostgreSQL do the work for you.

Yeah, I'm definitely interested in keeping as much of the DST stuff outside
my code as possible. I guess my concern is that I think there are other
places where this database is being used in my codebase that may rely on
the database time setting being UTC (or really, GMT, though I don't think
there's a difference for these purposes). It would be best if all of my
application's code declared its intentions about the time zone of the
database connection before running its query, but I don't think that's a
feasible change to make right now. That's what's motivating my pursuit of
finding a way to write these queries without changing this setting, through
appropriate casting and such.

> create function generate_series(timestamptz, timestamptz, interval, text)
> returns setof timestamptz
strict immutable language plpgsql as
$$
begin
perform set_config('timezone', $4, true);
return query select generate_series($1, $2, $3);
end
$$ set timezone = 'UTC';

This is a nice suggestion, and in fact, it would be fine from my
perspective to reset to UTC every time. My concern is only around the
safety of the final `set timezone`. Under what circumstances/when can I
count on that being set? E.g. if a query using that function was cancelled
before finishing, would the connection timezone remain as $4? I guess the
3rd parameter to set_config is `is_local` (based on
https://pgpedia.info/s/set_config.html). Does that mean I could run this
outside this context of a function, and expect the setting to go back to
UTC on a rollback? Apologies if these are naive questions.

Thanks again for all the help.

Best,
Lincoln


On Wed, Oct 4, 2023 at 11:09 AM Marian Wendt  wrote:

>
> Am 04.10.2023 um 16:11 schrieb Tom Lane:
>
> Steve Crawford  
>  writes:
>
> On Tue, Oct 3, 2023 at 7:33 PM Lincoln Swaine-Moore  
> 
> wrote:
>
> 5) Ideally, the solution would not involve messing with the
> server/connection's value of timezone. (Though I would be interested if
> there was a solution that relaxed this constraint and was relatively
> safe/compatible with transactions and psycopg2.)
>
> Note that setting the time zone is a client/connection setting so if you
> set it within a transaction, it will stay set when the transaction
> concludes. But time manipulation is tricky and trying to DIY reinvent the
> wheel is painful and often buggy. Let PostgreSQL do the work for you.
>
> Expanding on that philosophy: you should be able to set the timezone
> locally within a function, so that it wouldn't be that hard to make a
> wrapper for generate_series that emulates the 4-argument version added
> in v16.
>
> Rather than messing with manually saving and restoring the prevailing
> zone, I'd let the function SET infrastructure do it for me.  Sadly,
> that SET clause only takes literal constant arguments, so it'd go
> roughly like this:
>
> create function generate_series(timestamptz, timestamptz, 

Re: Strategies for converting UTC data to local windows for arbitrary resolutions and timezones

2023-10-04 Thread Lincoln Swaine-Moore
> No.  The function call mechanism will ensure that timezone goes back
> to its previous state at function exit.

> An actual rollback would undo the effects of set_config, yes.  You
> only need this function wrapper to ensure that subsequent operations
> in the same transaction don't see the setting change.

Excellent, thank you. So just to be explicit here, I could either run this
function, or set/run my query/set back, with the same behavior/safety
guarantees as if I was using the generate_series function with timezone
from v16?


On Wed, Oct 4, 2023 at 12:54 PM Tom Lane  wrote:

> Lincoln Swaine-Moore  writes:
> >>> create function generate_series(timestamptz, timestamptz, interval,
> text)
> >>> returns setof timestamptz
> >>> strict immutable language plpgsql as
> >>> $$
> >>> begin
> >>> perform set_config('timezone', $4, true);
> >>> return query select generate_series($1, $2, $3);
> >>> end
> >>> $$ set timezone = 'UTC';
>
> > This is a nice suggestion, and in fact, it would be fine from my
> > perspective to reset to UTC every time. My concern is only around the
> > safety of the final `set timezone`. Under what circumstances/when can I
> > count on that being set? E.g. if a query using that function was
> cancelled
> > before finishing, would the connection timezone remain as $4?
>
> No.  The function call mechanism will ensure that timezone goes back
> to its previous state at function exit.  (In the case of an error
> exit, that's actually handled by the transaction abort logic, but the
> result is the same.)  Because of that, I think it doesn't really
> matter whether the set_config call says "true" or "false", but saying
> that it's a local setting seems less confusing.
>
> > Does that mean I could run this
> > outside this context of a function, and expect the setting to go back to
> > UTC on a rollback?
>
> An actual rollback would undo the effects of set_config, yes.  You
> only need this function wrapper to ensure that subsequent operations
> in the same transaction don't see the setting change.
>
> regards, tom lane
>


-- 
Lincoln Swaine-Moore


Re: Strategies for converting UTC data to local windows for arbitrary resolutions and timezones

2023-10-04 Thread Lincoln Swaine-Moore
> Yeah.  One thing to keep in mind is that that might have different
> behavior in terms of the evaluation of the arguments to the function,
> ie which timezone setting is your input parsed according to.

I see. You mean, in the event that it doesn't conform to an entry in
`pg_timezone_names`? I do have control over the possible options the user
can provide, so it should be straightforward to make sure those all have
entries.

Thanks,
Lincoln


Re: Strategies for converting UTC data to local windows for arbitrary resolutions and timezones

2023-10-04 Thread Lincoln Swaine-Moore
> If you mean that your input will always include an explicit zone
> specification, then this doesn't affect you.  What I was thinking
> about was that
>
> select generate_series('2023-10-04 13:30', ...)

> is going to mean different things depending on the zone setting
> that prevails when that constant is parsed.

Gotcha--I thought you meant the timezone argument to the function ($4). I
can make sure that all the datetime arguments to the function are "with
timezone", so there should be no ambiguity. But I guess if I didn't, the
timezone given by $4 would be the one parsing the naive timestamps, which I
think would also be ok from my perspective.

Thanks!


Re: Strategies for converting UTC data to local windows for arbitrary resolutions and timezones

2023-10-04 Thread Lincoln Swaine-Moore
> Really UTC (not that it matters for calculation purposes)

Sorry, yes--I just meant that literally when I run `show timezone` on the
server in question, I get `GMT`.

> As to "other places" in your (or others) code, provided that you are
storing and manipulating your data as timestamp WITH time zone (point in
time)

Unfortunately this is not the case for my codebase/schema. The data is
stored WITHOUT. I glossed over this fact in my post, because they are in
fact UTC times that have had their (0) offset truncated, and prior to my
usage of any solution I can use `at time zone 'UTC'` to properly handle
them. But I don't think it's the case that in the application more
generally this is being handled gracefully, and so the behavior that is
currently (correctly) running is being propped up by the database's time
zone setting. In the long term, I think it might be appropriate to create a
new column that is simply `t at time zone 'UTC'`, and use that everywhere,
but that's not feasible for the task I'm handling right now.

Thanks!


Re: Strategies for converting UTC data to local windows for arbitrary resolutions and timezones

2023-10-04 Thread Lincoln Swaine-Moore
> What I do in such cases is to add an extra column with the UTC timestamp
to serve as a linear scale to the local timestamps. That also helps with
ordering buckets in reports and such during DST changes (especially the
ones where an hour repeats).

> For hours and quarter hours I found it to be fairly convenient to base a
view on a join between a date calendar and an (quarter of an) hour per UTC
day table, but materialising that with some indexes may perform better (at
the cost of disk space). I do materialise that currently, but our database
server doesn’t have a lot of memory so I’m often not hitting the cache and
performance suffers a bit (infrastructure is about to change for the better
though).

That's an interesting idea, but I'm not sure I fully understand. Assuming
you're aggregating data: what do you group by? For instance, at an hourly
resolution, if you group by both the UTC timestamp and the local one, you
might end up, say, dividing an hour-long bucket in two for time zones with
half-hour-based offsets, no?

Thanks for the detailed writeup! Definitely helpful to learn more about
what people are using in production to handle this sort of thing.

-- 
Lincoln Swaine-Moore