andygrove opened a new pull request, #3877:
URL: https://github.com/apache/datafusion-comet/pull/3877
## Which issue does this PR close?
Closes #2649.
## Rationale for this change
`date_trunc` (TruncTimestamp) was producing wrong results for non-UTC
timezones due to two bugs:
1. **DST offset bug**: The truncation functions (`trunc_date_to_year`,
`trunc_date_to_quarter`, etc.) modify date/time fields on a `DateTime<Tz>`
while preserving the *original* UTC offset. When a DST transition falls between
the input timestamp and the truncated result (e.g. truncating a November
timestamp to the start of Q4 in October in America/Denver), the offset at the
truncated date is different, causing results off by 1 hour.
2. **Missing timezone annotation**: `TimestampTruncExpr::data_type()`
returned `Timestamp(Microsecond, None)` instead of `Timestamp(Microsecond,
Some(tz))`, causing schema mismatches like `RowConverter column schema
mismatch, expected Timestamp(Microsecond, Some("America/Denver")) got
Timestamp(Microsecond, Some("UTC"))` during shuffle/sort.
## What changes are included in this PR?
**`native/spark-expr/src/kernels/temporal.rs`**:
- Replaced `as_micros_from_unix_epoch_utc` with
`as_micros_from_local_datetime_tz` which, after truncation, extracts the naive
local datetime and re-applies the timezone via `tz.from_local_datetime()` —
performing a fresh DST lookup matching Java's `ZonedDateTime.of()` behavior.
- Updated `as_timestamp_tz_with_op` and `as_timestamp_tz_with_op_single` to
pass `&Tz` to their closures and return timezone-annotated output arrays.
- Updated the `timestamp_trunc_array_fmt_helper\!` macro the same way.
- Timezones with non-1-hour DST transitions (e.g. `Australia/Lord_Howe` at
30 min) are marked `Incompatible` at the Scala planning layer and fall back to
Spark, keeping the Rust `LocalResult::None` handling simple.
**`native/spark-expr/src/datetime_funcs/timestamp_trunc.rs`**:
- Fixed `data_type()` to return `Timestamp(Microsecond, Some(tz))` with the
expression's timezone.
**`spark/src/main/scala/org/apache/comet/serde/datetime.scala`**:
- Removed the blanket `Incompatible` guard for all non-UTC timezones.
- Added `hasNonHourDst()` using `ZoneId.getRules().getTransitionRules()` to
fall back to Spark only for the rare timezones with non-1-hour DST transitions.
**`spark/src/test/scala/org/apache/comet/CometTemporalExpressionSuite.scala`**:
- Removed the UTC-only restriction; tests now run with `UTC`,
`America/Los_Angeles`, and `Europe/London`.
## How are these changes tested?
- New Rust unit test `test_timestamp_trunc_dst_boundary` verifies a concrete
DST boundary case: a November timestamp in `America/Denver` (MST, UTC-7)
truncated to `QUARTER` lands on `2023-10-01 00:00:00 MDT` (UTC-6), not
`00:00:00 MST` (UTC-7). The old code would return a result 1 hour later than
correct.
- Existing JVM integration tests in `CometTemporalExpressionSuite` now run
against multiple non-UTC timezones (`America/Los_Angeles`, `Europe/London`) in
addition to UTC, comparing Comet output against Spark for all supported
`date_trunc` formats.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]