enrico-stauss opened a new issue, #3766:
URL: https://github.com/apache/arrow-adbc/issues/3766
### What happened?
# Bug Report
## Summary
When ingesting timezone-aware Arrow timestamps (e.g., `timestamp[us,
tz=UTC]`) via the Snowflake ADBC driver, tables are automatically created with
`TIMESTAMP_LTZ` columns instead of `TIMESTAMP_TZ` columns. This loses the
explicit timezone information present in the Arrow data and causes temporal
misalignment issues in multi-source pipelines.
## Related Issues
This is related to #2843, but specifically addresses **timezone-aware**
timestamps where Arrow explicitly provides timezone metadata.
## What happened?
The Snowflake ADBC driver maps timezone-aware Arrow timestamps to
`TIMESTAMP_LTZ` during table creation:
```
Arrow: timestamp[us, tz=UTC] → Snowflake: TIMESTAMP_LTZ
```
This is semantically incorrect because:
- **`TIMESTAMP_LTZ`** stores UTC values but interprets them relative to the
**session timezone**
- **`TIMESTAMP_TZ`** stores UTC values **with explicit timezone metadata
preserved**
When Arrow data explicitly specifies a timezone, that information should be
preserved in Snowflake's `TIMESTAMP_TZ` type.
## What did you expect to happen?
Timezone-aware Arrow timestamps should map to `TIMESTAMP_TZ`:
```
Arrow: timestamp[us, tz=UTC] → Snowflake: TIMESTAMP_TZ
```
This preserves the explicit timezone information from the source data.
## Minimal Reproducible Example
```python
import polars as pol
from datetime import datetime, timezone
import adbc_driver_snowflake.dbapi
# Create DataFrame with explicit UTC timestamps
df = pol.DataFrame({
"id": [1, 2, 3],
"timestamp": [
datetime(2024, 12, 3, 10, 0, 0, tzinfo=timezone.utc),
datetime(2024, 12, 3, 11, 0, 0, tzinfo=timezone.utc),
datetime(2024, 12, 3, 12, 0, 0, tzinfo=timezone.utc),
]
}).with_columns(
pol.col("timestamp").dt.replace_time_zone("UTC").dt.cast_time_unit("us")
)
# Verify Arrow schema has timezone info
print(df.to_arrow().schema)
# timestamp: timestamp[us, tz=UTC]
# Push to Snowflake using ADBC
conn = adbc_driver_snowflake.dbapi.connect("<snowflake_uri>")
df.write_database(
table_name="test_table",
connection=conn,
engine="adbc",
if_table_exists="replace"
)
conn.close()
```
**Actual result in Snowflake:**
```sql
SHOW COLUMNS IN test_table;
-- timestamp column type: TIMESTAMP_LTZ(6)
```
**Expected result:**
```sql
-- timestamp column type: TIMESTAMP_TZ(6)
```
### Real-World Impact
This causes critical issues in multi-source data pipelines:
**Scenario:** Merging flight trajectory data from two sources (ACARS and
SWIM)
- **Source A:** Data manually created with `TIMESTAMP_TZ` schema →
timestamps always displayed as UTC
- **Source B:** Data ingested via ADBC → `TIMESTAMP_LTZ` created →
timestamps displayed in session timezone
**Result:** When users in non-UTC timezones query and merge these datasets,
the timestamps appear misaligned:
```python
# User's Snowflake session timezone: America/Los_Angeles (UTC-8)
# Query Source A (TIMESTAMP_TZ)
SELECT timestamp FROM source_a WHERE id = 1;
-- Returns: 2024-12-03 10:00:00 +0000 (explicit UTC)
# Query Source B (TIMESTAMP_LTZ)
SELECT timestamp FROM source_b WHERE id = 1;
-- Returns: 2024-12-03 02:00:00 -0800 (converted to session TZ)
# Merge the sources
SELECT a.timestamp AS ts_a, b.timestamp AS ts_b
FROM source_a a
JOIN source_b b ON a.id = b.id;
-- ts_a: 2024-12-03 10:00:00 +0000
-- ts_b: 2024-12-03 02:00:00 -0800
-- These appear 8 hours apart even though they represent the same instant!
```
This breaks temporal joins, time-series analysis, and data quality
validation.
## The Semantic Issue
When Arrow explicitly provides timezone metadata, the driver should preserve
that information:
| Arrow Type | Current Mapping | Correct Mapping | Reasoning |
|------------|----------------|-----------------|-----------|
| `timestamp[us]` (no tz) | `TIMESTAMP_LTZ` | `TIMESTAMP_NTZ` or
`TIMESTAMP_LTZ` | Ambiguous, either makes sense |
| `timestamp[us, tz=UTC]` | `TIMESTAMP_LTZ` ❌ | `TIMESTAMP_TZ` ✅ | Timezone
is explicitly specified |
The key difference:
- **`TIMESTAMP_LTZ`**: "This is a UTC instant, interpret it in the user's
local timezone"
- **`TIMESTAMP_TZ`**: "This is a UTC instant **and** the timezone (UTC) is
part of the data"
When Arrow says `tz=UTC`, that timezone should be stored in Snowflake.
## Current Workaround
Pre-create tables with explicit schema before ingestion:
```python
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE test_table (
id NUMBER(38, 0),
timestamp TIMESTAMP_TZ
)
""")
cursor.close()
# Then append data
df.write_database(
table_name="test_table",
connection=conn,
engine="adbc",
if_table_exists="append"
)
```
This works but requires boilerplate schema management and defeats the
purpose of schema inference from Arrow types.
## Proposed Solution
**Option 1 (Preferred):** Change default mapping for timezone-aware
timestamps
```
timestamp[us, tz=<any>] → TIMESTAMP_TZ
timestamp[us] (no tz) → TIMESTAMP_LTZ (current behavior, unchanged)
```
**Option 2:** Add a connection/statement option
```python
# Allow users to specify timestamp type preference
conn = adbc_driver_snowflake.dbapi.connect(
uri="<snowflake_uri>",
db_kwargs={"adbc.snowflake.ingest.timestamp_tz_for_aware": "true"}
)
```
**Option 3:** Follow Arrow metadata conventions
If Arrow timestamp has timezone metadata, use `TIMESTAMP_TZ`. This aligns
with Arrow's philosophy of preserving metadata.
## Environment
- **ADBC Driver:** `adbc-driver-snowflake` (latest from PyPI)
- **Arrow:** `pyarrow` (via Polars)
- **Language:** Python 3.11+
- **Framework:** Polars DataFrames
- **Snowflake:** Standard configuration
### Additional Context
The [Snowflake ADBC driver
documentation](https://arrow.apache.org/adbc/current/driver/snowflake.html#type-support)
states:
> **timestamp_ltz, timestamp_ntz, timestamp_tz**: timestamp[ns]
> Local time zone will be used, except for timestamp_ntz which is not an
instant.
This describes the **read path** (Snowflake → Arrow) but doesn't specify the
**write path** (Arrow → Snowflake) behavior for timezone-aware timestamps. The
write path mapping should be symmetric and preserve timezone information when
it exists in the source.
## Acknowledgment
I understand #2843 concluded that explicit schema creation is the
recommended approach. However, I believe timezone-aware Arrow data represents a
distinct case where the "correct" mapping is unambiguous: explicit timezone in
Arrow should map to explicit timezone in Snowflake (`TIMESTAMP_TZ`), not
session-dependent interpretation (`TIMESTAMP_LTZ`).
Even if the default behavior cannot change for backward compatibility, an
opt-in option would greatly improve the developer experience for timezone-aware
data pipelines.
### Stack Trace
_No response_
### How can we reproduce the bug?
_No response_
### Environment/Setup
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]