[ 
https://issues.apache.org/jira/browse/IMPALA-13627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17928158#comment-17928158
 ] 

ASF subversion and git services commented on IMPALA-13627:
----------------------------------------------------------

Commit 1b6395b8db09d271bd166bf501bdf7038d8be644 in impala's branch 
refs/heads/master from Michael Smith
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=1b6395b8d ]

IMPALA-13627: Handle legacy Hive timezone conversion

After HIVE-12191, Hive has 2 different methods of calculating timestamp
conversion from UTC to local timezone. When Impala has
convert_legacy_hive_parquet_utc_timestamps=true, it assumes times
written by Hive are in UTC and converts them to local time using tzdata,
which matches the newer method introduced by HIVE-12191.

Some dates convert differently between the two methods, such as
Asia/Kuala_Lumpur or Singapore prior to 1982 (also seen in HIVE-24074).
After HIVE-25104, Hive writes 'writer.zone.conversion.legacy' to
distinguish which method is being used. As a result there are three
different cases we have to handle:
1. Hive prior to 3.1 used what’s now called “legacy conversion” using
   SimpleDateFormat.
2. Hive 3.1.2 (with HIVE-21290) used a new Java API that’s based on
   tzdata and added metadata to identify the timezone.
3. Hive 4 support both, and added a new file metadata to identify it.

Adds handling for Hive files (identified by created_by=parquet-mr) where
we can infer the correct handling from Parquet file metadata:
1. if writer.zone.conversion.legacy is present (Hive 4), use it to
   determine whether to use a legacy conversion method compatible with
   Hive's legacy behavior, or convert using tzdata.
2. if writer.zone.conversion.legacy is not present but writer.time.zone
   is, we can infer it was written by Hive 3.1.2+ using new APIs.
3. otherwise it was likely written by an earlier Hive version.

Adds a new CLI and query option - use_legacy_hive_timestamp_conversion -
to select what conversion method to use in the 3rd case above, when
Impala determines that the file was written by Hive older than 3.1.2.
Defaults to false to minimize changes in Impala's behavior and because
going through JNI is ~50x slower even when the results would not differ;
Hive defaults to true for its equivalent setting:
hive.parquet.timestamp.legacy.conversion.enabled.

Hive legacy-compatible conversion uses a Java method that would be
complicated to mimic in C++, doing

  DateFormat formatter = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
  formatter.setTimeZone(TimeZone.getTimeZone(timezone_string));
  java.util.Date date = formatter.parse(date_time_string);
  formatter.setTimeZone(TimeZone.getTimeZone("UTC"));
  return out.println(formatter.format(date);

IMPALA-9385 added a check against a Timezone pointer in
FromUnixTimestamp. That dominates the time in FromUnixTimeNanos,
overriding any benchmark gains from IMPALA-7417. Moves FromUnixTime to
allow inlining, and switches to using UTCPTR in the benchmark - as
IMPALA-9385 did in most other code - to restore benchmark results.

Testing:
- Adds JVM conversion method to convert-timestamp-benchmark.
- Adds tests for several cases from Hive conversion tests.

Change-Id: I1271ed1da0b74366ab8315e7ec2d4ee47111e067
Reviewed-on: http://gerrit.cloudera.org:8080/22293
Tested-by: Impala Public Jenkins <[email protected]>
Reviewed-by: Csaba Ringhofer <[email protected]>


> Impala uses different timezone conversion when reading Hive with legacy 
> conversion
> ----------------------------------------------------------------------------------
>
>                 Key: IMPALA-13627
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13627
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 4.0.0, Impala 4.4.1
>            Reporter: Michael Smith
>            Assignee: Michael Smith
>            Priority: Critical
>              Labels: avro, parquet
>             Fix For: Impala 4.6.0
>
>
> HIVE-12192, HIVE-20007 changed the way that timestamp computations are 
> performed and to some extend how timestamps are serialized and deserialized 
> in files (Parquet, Avro) by Hive. HIVE-25104 was added to allow Hive to 
> continue to write files using the legacy timestamp conversion so that older 
> Hive versions can read the correct time.
> All of that is background to say that Impala - when converting UTC timestamps 
> with {{convert_legacy_hive_parquet_utc_timestamps}} - does not mirror Hive's 
> timestamp conversion when reading INT96 coded timestamps that Hive converted 
> from local time to UTC using legacy timezone conversion. To reproduce
> # Start Hive with TZ=Asia/Kuala_Lumpur
> # Using beeline
> {code}
> create table test (d timestamp) stored as parquet;
> set hive.parquet.timestamp.write.legacy.conversion.enabled=true;
> insert into test values (cast("1900-01-01 00:00:00" as timestamp));
> select * from test;
> {code}
> # Run {{impala-shell.sh -Q timezone=Asia/Kuala_Lumpur -Q 
> convert_legacy_hive_parquet_utc_timestamps=true -q 'select * from test'}}
> In this particular example, Asia/Kuala_Lumpur will either map to (LMT) or 
> (SGT) depending on tzdata version. In either case, that time zone for 1900 
> differs from the current UTC+8 timezone shift, so Impala shows a value that's 
> off by ~1 hour.
> The Parquet file Hive writes in this case contains 
> [key_value_metadata|https://parquet.apache.org/docs/file-format/metadata/] 
> such that Hive can identify what conversion to use when reading the data, so 
> newer Hive always handles these files correctly
> {code}
> writer.time.zone=Asia/Kuala_Lumpur
> writer.model.name=3.1.3000.7.1.7.1000-141
> writer.date.proleptic=false
> writer.zone.conversion.legacy=true
> {code}
> Impala could support the same behavior to be compatible with Hive by 
> identifying the {{writer.zone.conversion.legacy}} flag and handing conversion 
> to a SimpleDateFormat in Java ([Hive 
> code|https://github.com/apache/hive/blob/rel/release-4.0.1/common/src/java/org/apache/hadoop/hive/common/type/TimestampTZUtil.java#L194-L201]).
> Impala also behaves differently with older files where 
> {{writer.zone.conversion.legacy}} is not set. Hive's behavior is controlled 
> by {{hive.parquet.timestamp.legacy.conversion.enabled}}, but Impala doesn't 
> have a mode that uses SimpleDateFormat conversion. Impala's concept of 
> "legacy" conversion is whether to assume UTC and convert to local timezone at 
> all. It always uses tzdata for that conversion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to