andygrove commented on issue #2649:
URL: 
https://github.com/apache/datafusion-comet/issues/2649#issuecomment-4173434302

   ## Findings from PR #3877
   
   I attempted a fix in #3877 that addressed both root causes identified in the 
issue. Here's a summary of the approach and remaining issues for whoever picks 
this up next.
   
   ### Two bugs identified
   
   **Bug 1 — DST offset preservation during truncation:** The `trunc_date_to_*` 
functions in `native/spark-expr/src/kernels/temporal.rs` modify date/time 
fields on a `DateTime<Tz>` while preserving the original UTC offset. When a DST 
transition falls between the input timestamp and the truncated result (e.g., 
truncating a November timestamp in `America/Denver` to QUARTER lands on October 
1, which is still MDT not MST), the preserved offset is wrong by 1 hour.
   
   **Fix approach:** Extract `naive_local()` from the input `DateTime<Tz>`, 
perform truncation on the `NaiveDateTime`, then re-apply the timezone via 
`tz.from_local_datetime()` to get a fresh DST lookup — matching Java's 
`ZonedDateTime.of()` behavior. The new function 
`as_micros_from_naive_datetime_tz` replaces `as_micros_from_unix_epoch_utc` and 
handles:
   - `LocalResult::Single` — straightforward
   - `LocalResult::Ambiguous` (fall-back) — uses earlier instant, matching Java
   - `LocalResult::None` (spring-forward gap) — advances by 1 hour, which 
assumes all DST gaps are exactly 1 hour
   
   **Bug 2 — Missing timezone annotation on output:** 
`TimestampTruncExpr::data_type()` in 
`native/spark-expr/src/datetime_funcs/timestamp_trunc.rs` returned 
`Timestamp(Microsecond, None)` instead of `Timestamp(Microsecond, Some(tz))`, 
causing schema mismatches like `RowConverter column schema mismatch` during 
shuffle/sort. Fix: propagate the expression's timezone into the return type, 
and annotate the output arrays from `as_timestamp_tz_with_op` with 
`.with_timezone()`.
   
   ### Non-1-hour DST gap limitation
   
   The `LocalResult::None` handling assumes 1-hour DST gaps. Rare timezones 
like `Australia/Lord_Howe` (30-min DST transition) would produce incorrect 
results. The PR added a Scala-side `hasNonHourDst()` check using 
`ZoneId.getRules().getTransitionRules()` to mark those timezones as 
`Incompatible`, falling back to Spark. This replaced the blanket non-UTC 
incompatibility.
   
   ### What's still failing
   
   The Rust unit tests for the core truncation logic pass (including DST 
boundary and far-future date tests). The JVM integration tests in 
`CometTemporalExpressionSuite` (which compare Comet output against Spark across 
`UTC`, `America/Los_Angeles`, and `Europe/London`) are still failing. I didn't 
fully diagnose the remaining JVM failures before closing the PR.
   
   ### Key files changed
   
   - `native/spark-expr/src/kernels/temporal.rs` — core truncation DST fix
   - `native/spark-expr/src/datetime_funcs/timestamp_trunc.rs` — `data_type()` 
timezone annotation fix
   - `spark/src/main/scala/org/apache/comet/serde/datetime.scala` — 
`hasNonHourDst()` gating logic


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to