andygrove commented on issue #2649: URL: https://github.com/apache/datafusion-comet/issues/2649#issuecomment-4173434302
## Findings from PR #3877 I attempted a fix in #3877 that addressed both root causes identified in the issue. Here's a summary of the approach and remaining issues for whoever picks this up next. ### Two bugs identified **Bug 1 — DST offset preservation during truncation:** The `trunc_date_to_*` functions in `native/spark-expr/src/kernels/temporal.rs` modify date/time fields on a `DateTime<Tz>` while preserving the original UTC offset. When a DST transition falls between the input timestamp and the truncated result (e.g., truncating a November timestamp in `America/Denver` to QUARTER lands on October 1, which is still MDT not MST), the preserved offset is wrong by 1 hour. **Fix approach:** Extract `naive_local()` from the input `DateTime<Tz>`, perform truncation on the `NaiveDateTime`, then re-apply the timezone via `tz.from_local_datetime()` to get a fresh DST lookup — matching Java's `ZonedDateTime.of()` behavior. The new function `as_micros_from_naive_datetime_tz` replaces `as_micros_from_unix_epoch_utc` and handles: - `LocalResult::Single` — straightforward - `LocalResult::Ambiguous` (fall-back) — uses earlier instant, matching Java - `LocalResult::None` (spring-forward gap) — advances by 1 hour, which assumes all DST gaps are exactly 1 hour **Bug 2 — Missing timezone annotation on output:** `TimestampTruncExpr::data_type()` in `native/spark-expr/src/datetime_funcs/timestamp_trunc.rs` returned `Timestamp(Microsecond, None)` instead of `Timestamp(Microsecond, Some(tz))`, causing schema mismatches like `RowConverter column schema mismatch` during shuffle/sort. Fix: propagate the expression's timezone into the return type, and annotate the output arrays from `as_timestamp_tz_with_op` with `.with_timezone()`. ### Non-1-hour DST gap limitation The `LocalResult::None` handling assumes 1-hour DST gaps. Rare timezones like `Australia/Lord_Howe` (30-min DST transition) would produce incorrect results. The PR added a Scala-side `hasNonHourDst()` check using `ZoneId.getRules().getTransitionRules()` to mark those timezones as `Incompatible`, falling back to Spark. This replaced the blanket non-UTC incompatibility. ### What's still failing The Rust unit tests for the core truncation logic pass (including DST boundary and far-future date tests). The JVM integration tests in `CometTemporalExpressionSuite` (which compare Comet output against Spark across `UTC`, `America/Los_Angeles`, and `Europe/London`) are still failing. I didn't fully diagnose the remaining JVM failures before closing the PR. ### Key files changed - `native/spark-expr/src/kernels/temporal.rs` — core truncation DST fix - `native/spark-expr/src/datetime_funcs/timestamp_trunc.rs` — `data_type()` timezone annotation fix - `spark/src/main/scala/org/apache/comet/serde/datetime.scala` — `hasNonHourDst()` gating logic -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
