[
https://issues.apache.org/jira/browse/IMPALA-13627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Smith updated IMPALA-13627:
-----------------------------------
Description:
HIVE-12192, HIVE-20007 changed the way that timestamp computations are
performed and to some extend how timestamps are serialized and deserialized in
files (Parquet, Avro) by Hive. HIVE-25104 was added to allow Hive to continue
to write files using the legacy timestamp conversion so that older Hive
versions can read the correct time.
All of that is background to say that Impala - when converting UTC timestamps
with {{convert_legacy_hive_parquet_utc_timestamps}} - does not mirror Hive's
timestamp conversion when reading INT96 coded timestamps that Hive converted
from local time to UTC using legacy timezone conversion. To reproduce
# Start Hive with TZ=Asia/Kuala_Lumpur
# Using beeline
{code}
create table test (d timestamp) stored as parquet;
set hive.parquet.timestamp.write.legacy.conversion.enabled=true;
insert into test values (cast("1900-01-01 00:00:00" as timestamp));
select * from test;
{code}
# Start Impala with TZ=Asia/Kuala_Lumpur
# Run {{impala-shell.sh -q 'select * from test'}}
In this particular example, Asia/Kuala_Lumpur will either map to (LMT) or (SGT)
depending on tzdata version. In either case, that time zone for 1900 differs
from the current UTC+8 timezone shift, so Impala shows a value that's off by ~1
hour.
The Parquet file Hive writes in this case contains
[key_value_metadata|https://parquet.apache.org/docs/file-format/metadata/] such
that Hive can identify what conversion to use when reading the data, so newer
Hive always handles these files correctly
{code}
writer.time.zone=Asia/Kuala_Lumpur
writer.model.name=3.1.3000.7.1.7.1000-141
writer.date.proleptic=false
writer.zone.conversion.legacy=true
{code}
Impala could support the same behavior to be compatible with Hive by
identifying the {{writer.zone.conversion.legacy}} flag and handing conversion
to a SimpleDateFormat in Java ([Hive
code|https://github.com/apache/hive/blob/rel/release-4.0.1/common/src/java/org/apache/hadoop/hive/common/type/TimestampTZUtil.java#L194-L201]).
was:
HIVE-12192, HIVE-20007 changed the way that timestamp computations are
performed and to some extend how timestamps are serialized and deserialized in
files (Parquet, Avro) by Hive. HIVE-25104 was added to allow Hive to continue
to write files using the legacy timestamp conversion so that older Hive
versions can read the correct time.
All of that is background to say that Impala does not mirror Hive's timestamp
conversion when reading INT96 coded timestamps that Hive converted from local
time to UTC using legacy timezone conversion. To reproduce
# Start Hive with TZ=Asia/Kuala_Lumpur
# Using beeline
{code}
create table test (d timestamp) stored as parquet;
set hive.parquet.timestamp.write.legacy.conversion.enabled=true;
insert into test values (cast("1900-01-01 00:00:00" as timestamp));
select * from test;
{code}
# Start Impala with TZ=Asia/Kuala_Lumpur
# Run {{impala-shell.sh -q 'select * from test'}}
In this particular example, Asia/Kuala_Lumpur will either map to (LMT) or (SGT)
depending on tzdata version. In either case, that time zone for 1900 differs
from the current UTC+8 timezone shift, so Impala shows a value that's off by ~1
hour.
The Parquet file Hive writes in this case contains
[key_value_metadata|https://parquet.apache.org/docs/file-format/metadata/] such
that Hive can identify what conversion to use when reading the data, so newer
Hive always handles these files correctly
{code}
writer.time.zone=Asia/Kuala_Lumpur
writer.model.name=3.1.3000.7.1.7.1000-141
writer.date.proleptic=false
writer.zone.conversion.legacy=true
{code}
Impala could support the same behavior to be compatible with Hive by
identifying the {{writer.zone.conversion.legacy}} flag and handing conversion
to a SimpleDateFormat in Java ([Hive
code|https://github.com/apache/hive/blob/rel/release-4.0.1/common/src/java/org/apache/hadoop/hive/common/type/TimestampTZUtil.java#L194-L201]).
> Impala uses different timezone conversion when reading Hive with legacy
> conversion
> ----------------------------------------------------------------------------------
>
> Key: IMPALA-13627
> URL: https://issues.apache.org/jira/browse/IMPALA-13627
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Affects Versions: Impala 4.4.1
> Reporter: Michael Smith
> Assignee: Michael Smith
> Priority: Critical
> Labels: avro, parquet
>
> HIVE-12192, HIVE-20007 changed the way that timestamp computations are
> performed and to some extend how timestamps are serialized and deserialized
> in files (Parquet, Avro) by Hive. HIVE-25104 was added to allow Hive to
> continue to write files using the legacy timestamp conversion so that older
> Hive versions can read the correct time.
> All of that is background to say that Impala - when converting UTC timestamps
> with {{convert_legacy_hive_parquet_utc_timestamps}} - does not mirror Hive's
> timestamp conversion when reading INT96 coded timestamps that Hive converted
> from local time to UTC using legacy timezone conversion. To reproduce
> # Start Hive with TZ=Asia/Kuala_Lumpur
> # Using beeline
> {code}
> create table test (d timestamp) stored as parquet;
> set hive.parquet.timestamp.write.legacy.conversion.enabled=true;
> insert into test values (cast("1900-01-01 00:00:00" as timestamp));
> select * from test;
> {code}
> # Start Impala with TZ=Asia/Kuala_Lumpur
> # Run {{impala-shell.sh -q 'select * from test'}}
> In this particular example, Asia/Kuala_Lumpur will either map to (LMT) or
> (SGT) depending on tzdata version. In either case, that time zone for 1900
> differs from the current UTC+8 timezone shift, so Impala shows a value that's
> off by ~1 hour.
> The Parquet file Hive writes in this case contains
> [key_value_metadata|https://parquet.apache.org/docs/file-format/metadata/]
> such that Hive can identify what conversion to use when reading the data, so
> newer Hive always handles these files correctly
> {code}
> writer.time.zone=Asia/Kuala_Lumpur
> writer.model.name=3.1.3000.7.1.7.1000-141
> writer.date.proleptic=false
> writer.zone.conversion.legacy=true
> {code}
> Impala could support the same behavior to be compatible with Hive by
> identifying the {{writer.zone.conversion.legacy}} flag and handing conversion
> to a SimpleDateFormat in Java ([Hive
> code|https://github.com/apache/hive/blob/rel/release-4.0.1/common/src/java/org/apache/hadoop/hive/common/type/TimestampTZUtil.java#L194-L201]).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]