[ 
https://issues.apache.org/jira/browse/IMPALA-13627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Smith updated IMPALA-13627:
-----------------------------------
    Description: 
HIVE-12192, HIVE-20007 changed the way that timestamp computations are 
performed and to some extend how timestamps are serialized and deserialized in 
files (Parquet, Avro) by Hive. HIVE-25104 was added to allow Hive to continue 
to write files using the legacy timestamp conversion so that older Hive 
versions can read the correct time.

All of that is background to say that Impala - when converting UTC timestamps 
with {{convert_legacy_hive_parquet_utc_timestamps}} - does not mirror Hive's 
timestamp conversion when reading INT96 coded timestamps that Hive converted 
from local time to UTC using legacy timezone conversion. To reproduce
# Start Hive with TZ=Asia/Kuala_Lumpur
# Using beeline
{code}
create table test (d timestamp) stored as parquet;
set hive.parquet.timestamp.write.legacy.conversion.enabled=true;
insert into test values (cast("1900-01-01 00:00:00" as timestamp));
select * from test;
{code}
# Run {{impala-shell.sh -Q timezone=Asia/Kuala_Lumpur -Q 
convert_legacy_hive_parquet_utc_timestamps=true -q 'select * from test'}}

In this particular example, Asia/Kuala_Lumpur will either map to (LMT) or (SGT) 
depending on tzdata version. In either case, that time zone for 1900 differs 
from the current UTC+8 timezone shift, so Impala shows a value that's off by ~1 
hour.

The Parquet file Hive writes in this case contains 
[key_value_metadata|https://parquet.apache.org/docs/file-format/metadata/] such 
that Hive can identify what conversion to use when reading the data, so newer 
Hive always handles these files correctly
{code}
writer.time.zone=Asia/Kuala_Lumpur
writer.model.name=3.1.3000.7.1.7.1000-141
writer.date.proleptic=false
writer.zone.conversion.legacy=true
{code}

Impala could support the same behavior to be compatible with Hive by 
identifying the {{writer.zone.conversion.legacy}} flag and handing conversion 
to a SimpleDateFormat in Java ([Hive 
code|https://github.com/apache/hive/blob/rel/release-4.0.1/common/src/java/org/apache/hadoop/hive/common/type/TimestampTZUtil.java#L194-L201]).

Impala also behaves differently with older files where 
{{writer.zone.conversion.legacy}} is not set. Hive's behavior is controlled by 
{{hive.parquet.timestamp.legacy.conversion.enabled}}, but Impala doesn't have a 
mode that uses SimpleDateFormat conversion. Impala's concept of "legacy" 
conversion is whether to assume UTC and convert to local timezone at all. It 
always uses tzdata for that conversion.

  was:
HIVE-12192, HIVE-20007 changed the way that timestamp computations are 
performed and to some extend how timestamps are serialized and deserialized in 
files (Parquet, Avro) by Hive. HIVE-25104 was added to allow Hive to continue 
to write files using the legacy timestamp conversion so that older Hive 
versions can read the correct time.

All of that is background to say that Impala - when converting UTC timestamps 
with {{convert_legacy_hive_parquet_utc_timestamps}} - does not mirror Hive's 
timestamp conversion when reading INT96 coded timestamps that Hive converted 
from local time to UTC using legacy timezone conversion. To reproduce
# Start Hive with TZ=Asia/Kuala_Lumpur
# Using beeline
{code}
create table test (d timestamp) stored as parquet;
set hive.parquet.timestamp.write.legacy.conversion.enabled=true;
insert into test values (cast("1900-01-01 00:00:00" as timestamp));
select * from test;
{code}
# Run {{impala-shell.sh -Q timezone=Asia/Kuala_Lumpur -Q 
convert_legacy_hive_parquet_utc_timestamps=true -q 'select * from test'}}

In this particular example, Asia/Kuala_Lumpur will either map to (LMT) or (SGT) 
depending on tzdata version. In either case, that time zone for 1900 differs 
from the current UTC+8 timezone shift, so Impala shows a value that's off by ~1 
hour.

The Parquet file Hive writes in this case contains 
[key_value_metadata|https://parquet.apache.org/docs/file-format/metadata/] such 
that Hive can identify what conversion to use when reading the data, so newer 
Hive always handles these files correctly
{code}
writer.time.zone=Asia/Kuala_Lumpur
writer.model.name=3.1.3000.7.1.7.1000-141
writer.date.proleptic=false
writer.zone.conversion.legacy=true
{code}

Impala could support the same behavior to be compatible with Hive by 
identifying the {{writer.zone.conversion.legacy}} flag and handing conversion 
to a SimpleDateFormat in Java ([Hive 
code|https://github.com/apache/hive/blob/rel/release-4.0.1/common/src/java/org/apache/hadoop/hive/common/type/TimestampTZUtil.java#L194-L201]).


> Impala uses different timezone conversion when reading Hive with legacy 
> conversion
> ----------------------------------------------------------------------------------
>
>                 Key: IMPALA-13627
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13627
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 4.4.1
>            Reporter: Michael Smith
>            Assignee: Michael Smith
>            Priority: Critical
>              Labels: avro, parquet
>
> HIVE-12192, HIVE-20007 changed the way that timestamp computations are 
> performed and to some extend how timestamps are serialized and deserialized 
> in files (Parquet, Avro) by Hive. HIVE-25104 was added to allow Hive to 
> continue to write files using the legacy timestamp conversion so that older 
> Hive versions can read the correct time.
> All of that is background to say that Impala - when converting UTC timestamps 
> with {{convert_legacy_hive_parquet_utc_timestamps}} - does not mirror Hive's 
> timestamp conversion when reading INT96 coded timestamps that Hive converted 
> from local time to UTC using legacy timezone conversion. To reproduce
> # Start Hive with TZ=Asia/Kuala_Lumpur
> # Using beeline
> {code}
> create table test (d timestamp) stored as parquet;
> set hive.parquet.timestamp.write.legacy.conversion.enabled=true;
> insert into test values (cast("1900-01-01 00:00:00" as timestamp));
> select * from test;
> {code}
> # Run {{impala-shell.sh -Q timezone=Asia/Kuala_Lumpur -Q 
> convert_legacy_hive_parquet_utc_timestamps=true -q 'select * from test'}}
> In this particular example, Asia/Kuala_Lumpur will either map to (LMT) or 
> (SGT) depending on tzdata version. In either case, that time zone for 1900 
> differs from the current UTC+8 timezone shift, so Impala shows a value that's 
> off by ~1 hour.
> The Parquet file Hive writes in this case contains 
> [key_value_metadata|https://parquet.apache.org/docs/file-format/metadata/] 
> such that Hive can identify what conversion to use when reading the data, so 
> newer Hive always handles these files correctly
> {code}
> writer.time.zone=Asia/Kuala_Lumpur
> writer.model.name=3.1.3000.7.1.7.1000-141
> writer.date.proleptic=false
> writer.zone.conversion.legacy=true
> {code}
> Impala could support the same behavior to be compatible with Hive by 
> identifying the {{writer.zone.conversion.legacy}} flag and handing conversion 
> to a SimpleDateFormat in Java ([Hive 
> code|https://github.com/apache/hive/blob/rel/release-4.0.1/common/src/java/org/apache/hadoop/hive/common/type/TimestampTZUtil.java#L194-L201]).
> Impala also behaves differently with older files where 
> {{writer.zone.conversion.legacy}} is not set. Hive's behavior is controlled 
> by {{hive.parquet.timestamp.legacy.conversion.enabled}}, but Impala doesn't 
> have a mode that uses SimpleDateFormat conversion. Impala's concept of 
> "legacy" conversion is whether to assume UTC and convert to local timezone at 
> all. It always uses tzdata for that conversion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to