zclllyybb commented on issue #63640: URL: https://github.com/apache/doris/issues/63640#issuecomment-4535781333
Initial triage: This looks like a real 4.0.x external-reader DATE decoding bug, not an Iceberg snapshot mismatch. The issue shows every Iceberg `DATE` value shifted exactly one day earlier while the `TIMESTAMP` column stays aligned with Spark. In the 4.0.5 and 4.0.2-rc02 code path, Iceberg Parquet files are scanned through `IcebergParquetReader`, which delegates the physical DATE conversion to the shared Parquet converter. That converter maps Parquet logical `DATE` to Doris `DATEV2`, but then applies a session-timezone-derived `offset_days` when converting the stored epoch-day integer: - `be/src/vec/exec/scan/file_scanner.cpp`: Iceberg Parquet ranges are wrapped by `IcebergParquetReader`. - `be/src/vec/exec/format/table/iceberg_reader.cpp`: `IcebergParquetReader` calls the shared `ParquetReader`. - `be/src/vec/exec/format/parquet/schema_desc.cpp`: Parquet logical `DATE` is mapped to `TYPE_DATEV2`. - `be/src/vec/exec/format/parquet/parquet_column_convert.h`: `ConvertParams::init()` derives `offset_days` from `from_unixtime(0, session_timezone)`, and `Int32ToDate` adds that offset to the Parquet DATE day count. For a west-of-UTC session timezone, `from_unixtime(0, tz)` is `1969-12-31 ...`, so `offset_days = -1`; then a stored Iceberg DATE day count of `0` (`1970-01-01`) is decoded as day `-1` (`1969-12-31`). That matches the reported output exactly. DATE is a logical calendar day and should not depend on the query/session timezone. There is already a closely related upstream fix on master / 4.1: `#61722` (`[fix](hive) Fix Hive DATE timezone shift in external readers`). Although the title says Hive, the Parquet part removes the same shared `offset_days` adjustment from `be/src/format/parquet/parquet_column_convert.h`, so the same principle should apply to Iceberg Parquet DATE reads. I checked locally that this fix is not an ancestor of the reported `4.0.5` or `4.0.2-rc02` tags. Recommended next steps: 1. Backport/apply the Parquet DATE part of `#61722` to the 4.0 branch, and confirm it covers Iceberg as well as Hive because both use the shared Parquet physical-to-logical DATE converter. 2. Add an Iceberg regression case for DATE reads under at least two Doris time zones, e.g. `UTC` and a west timezone such as `America/Mexico_City` or `-06:00`, using the repro rows from this issue. 3. Ask the reporter to confirm `select @@time_zone;`, the Iceberg data file format (`parquet` vs `orc`), and whether `set time_zone = 'UTC'` makes Doris return the Spark dates. This is not needed to see the code bug, but it will confirm the exact runtime trigger in their deployment. No code was changed in this triage. Breakwater-GitHub-Analysis-Slot: slot_aa7376560be6 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
