enrico-stauss opened a new issue, #3766:
URL: https://github.com/apache/arrow-adbc/issues/3766

   ### What happened?
   
   # Bug Report
   
   ## Summary
   
   When ingesting timezone-aware Arrow timestamps (e.g., `timestamp[us, 
tz=UTC]`) via the Snowflake ADBC driver, tables are automatically created with 
`TIMESTAMP_LTZ` columns instead of `TIMESTAMP_TZ` columns. This loses the 
explicit timezone information present in the Arrow data and causes temporal 
misalignment issues in multi-source pipelines.
   
   ## Related Issues
   
   This is related to #2843, but specifically addresses **timezone-aware** 
timestamps where Arrow explicitly provides timezone metadata.
   
   ## What happened?
   
   The Snowflake ADBC driver maps timezone-aware Arrow timestamps to 
`TIMESTAMP_LTZ` during table creation:
   ```
   Arrow: timestamp[us, tz=UTC] → Snowflake: TIMESTAMP_LTZ
   ```
   
   This is semantically incorrect because:
   - **`TIMESTAMP_LTZ`** stores UTC values but interprets them relative to the 
**session timezone**
   - **`TIMESTAMP_TZ`** stores UTC values **with explicit timezone metadata 
preserved**
   
   When Arrow data explicitly specifies a timezone, that information should be 
preserved in Snowflake's `TIMESTAMP_TZ` type.
   
   ## What did you expect to happen?
   
   Timezone-aware Arrow timestamps should map to `TIMESTAMP_TZ`:
   ```
   Arrow: timestamp[us, tz=UTC] → Snowflake: TIMESTAMP_TZ
   ```
   
   This preserves the explicit timezone information from the source data.
   
   ## Minimal Reproducible Example
   ```python
   import polars as pol
   from datetime import datetime, timezone
   import adbc_driver_snowflake.dbapi
   
   # Create DataFrame with explicit UTC timestamps
   df = pol.DataFrame({
       "id": [1, 2, 3],
       "timestamp": [
           datetime(2024, 12, 3, 10, 0, 0, tzinfo=timezone.utc),
           datetime(2024, 12, 3, 11, 0, 0, tzinfo=timezone.utc),
           datetime(2024, 12, 3, 12, 0, 0, tzinfo=timezone.utc),
       ]
   }).with_columns(
       pol.col("timestamp").dt.replace_time_zone("UTC").dt.cast_time_unit("us")
   )
   
   # Verify Arrow schema has timezone info
   print(df.to_arrow().schema)
   # timestamp: timestamp[us, tz=UTC]
   
   # Push to Snowflake using ADBC
   conn = adbc_driver_snowflake.dbapi.connect("<snowflake_uri>")
   df.write_database(
       table_name="test_table",
       connection=conn,
       engine="adbc",
       if_table_exists="replace"
   )
   conn.close()
   ```
   
   **Actual result in Snowflake:**
   ```sql
   SHOW COLUMNS IN test_table;
   -- timestamp column type: TIMESTAMP_LTZ(6)
   ```
   
   **Expected result:**
   ```sql
   -- timestamp column type: TIMESTAMP_TZ(6)
   ```
   
   ### Real-World Impact
   
   This causes critical issues in multi-source data pipelines:
   
   **Scenario:** Merging flight trajectory data from two sources (ACARS and 
SWIM)
   - **Source A:** Data manually created with `TIMESTAMP_TZ` schema → 
timestamps always displayed as UTC
   - **Source B:** Data ingested via ADBC → `TIMESTAMP_LTZ` created → 
timestamps displayed in session timezone
   
   **Result:** When users in non-UTC timezones query and merge these datasets, 
the timestamps appear misaligned:
   ```python
   # User's Snowflake session timezone: America/Los_Angeles (UTC-8)
   
   # Query Source A (TIMESTAMP_TZ)
   SELECT timestamp FROM source_a WHERE id = 1;
   -- Returns: 2024-12-03 10:00:00 +0000 (explicit UTC)
   
   # Query Source B (TIMESTAMP_LTZ) 
   SELECT timestamp FROM source_b WHERE id = 1;
   -- Returns: 2024-12-03 02:00:00 -0800 (converted to session TZ)
   
   # Merge the sources
   SELECT a.timestamp AS ts_a, b.timestamp AS ts_b 
   FROM source_a a 
   JOIN source_b b ON a.id = b.id;
   -- ts_a: 2024-12-03 10:00:00 +0000
   -- ts_b: 2024-12-03 02:00:00 -0800
   -- These appear 8 hours apart even though they represent the same instant!
   ```
   
   This breaks temporal joins, time-series analysis, and data quality 
validation.
   
   ## The Semantic Issue
   
   When Arrow explicitly provides timezone metadata, the driver should preserve 
that information:
   
   | Arrow Type | Current Mapping | Correct Mapping | Reasoning |
   |------------|----------------|-----------------|-----------|
   | `timestamp[us]` (no tz) | `TIMESTAMP_LTZ` | `TIMESTAMP_NTZ` or 
`TIMESTAMP_LTZ` | Ambiguous, either makes sense |
   | `timestamp[us, tz=UTC]` | `TIMESTAMP_LTZ` ❌ | `TIMESTAMP_TZ` ✅ | Timezone 
is explicitly specified |
   
   The key difference:
   - **`TIMESTAMP_LTZ`**: "This is a UTC instant, interpret it in the user's 
local timezone"
   - **`TIMESTAMP_TZ`**: "This is a UTC instant **and** the timezone (UTC) is 
part of the data"
   
   When Arrow says `tz=UTC`, that timezone should be stored in Snowflake.
   
   ## Current Workaround
   
   Pre-create tables with explicit schema before ingestion:
   ```python
   cursor = conn.cursor()
   cursor.execute("""
       CREATE TABLE test_table (
           id NUMBER(38, 0),
           timestamp TIMESTAMP_TZ
       )
   """)
   cursor.close()
   
   # Then append data
   df.write_database(
       table_name="test_table",
       connection=conn,
       engine="adbc",
       if_table_exists="append"
   )
   ```
   
   This works but requires boilerplate schema management and defeats the 
purpose of schema inference from Arrow types.
   
   ## Proposed Solution
   
   **Option 1 (Preferred):** Change default mapping for timezone-aware 
timestamps
   ```
   timestamp[us, tz=<any>] → TIMESTAMP_TZ
   timestamp[us] (no tz)   → TIMESTAMP_LTZ (current behavior, unchanged)
   ```
   
   **Option 2:** Add a connection/statement option
   ```python
   # Allow users to specify timestamp type preference
   conn = adbc_driver_snowflake.dbapi.connect(
       uri="<snowflake_uri>",
       db_kwargs={"adbc.snowflake.ingest.timestamp_tz_for_aware": "true"}
   )
   ```
   
   **Option 3:** Follow Arrow metadata conventions
   If Arrow timestamp has timezone metadata, use `TIMESTAMP_TZ`. This aligns 
with Arrow's philosophy of preserving metadata.
   
   ## Environment
   
   - **ADBC Driver:** `adbc-driver-snowflake` (latest from PyPI)
   - **Arrow:** `pyarrow` (via Polars)
   - **Language:** Python 3.11+
   - **Framework:** Polars DataFrames
   - **Snowflake:** Standard configuration
   
   ### Additional Context
   
   The [Snowflake ADBC driver 
documentation](https://arrow.apache.org/adbc/current/driver/snowflake.html#type-support)
 states:
   
   > **timestamp_ltz, timestamp_ntz, timestamp_tz**: timestamp[ns]
   > Local time zone will be used, except for timestamp_ntz which is not an 
instant.
   
   This describes the **read path** (Snowflake → Arrow) but doesn't specify the 
**write path** (Arrow → Snowflake) behavior for timezone-aware timestamps. The 
write path mapping should be symmetric and preserve timezone information when 
it exists in the source.
   
   ## Acknowledgment
   
   I understand #2843 concluded that explicit schema creation is the 
recommended approach. However, I believe timezone-aware Arrow data represents a 
distinct case where the "correct" mapping is unambiguous: explicit timezone in 
Arrow should map to explicit timezone in Snowflake (`TIMESTAMP_TZ`), not 
session-dependent interpretation (`TIMESTAMP_LTZ`).
   
   Even if the default behavior cannot change for backward compatibility, an 
opt-in option would greatly improve the developer experience for timezone-aware 
data pipelines.
   
   ### Stack Trace
   
   _No response_
   
   ### How can we reproduce the bug?
   
   _No response_
   
   ### Environment/Setup
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to