For query 22, iceberg.mr.split.size affects the number of mappers. With the
default value of 128MB, Hive creates much fewer mappers than it does on ORC
tables.
For query 64, it is due to a bug in shared work optimization. Setting
hive.optimize.shared.work.extended to false produces correct results for query
64.
Because of several bugs in shared work optimization (and parallel edge fixer),
it might make sense to set the default value of
hive.optimize.shared.work to false in HiveConf.java.
--- Sungwoo
On Fri, 18 Nov 2022, Sungwoo Park wrote:
Hello Stamatis,
We use a recent or the latest commit in the master branch and run Hive on Tez
0.10.2.
For query 22, the slow execution seems to be related to the split size used
in IcebergInputFormat.getSplits(). We will try to create a JIRA when we make
more progress.
For query 64, the result is wrong (returning 0 rows) on 1TB TPC-DS, but there
is a separate report that the result is correct on 100GB TPC-DS. Not sure why
this happens, so we are going to run more experiments.
Best,
Sungwoo
On Thu, 17 Nov 2022, Stamatis Zampetakis wrote:
Hi Sungwoo,
Many thanks for sharing your findings; interesting observations.
If you can please also share the project versions that you used for running
the experiments.
Best,
Stamatis
On Tue, Nov 15, 2022 at 12:46 PM Sungwoo Park <c...@pl.postech.ac.kr> wrote:
Hello,
I ran the TPC-DS benchmark using Metastore (in the traditional way) and
Iceberg,
and would like to share the result for those interested in Hive using
Iceberg.
The experiment used 1TB TPC-DS dataset stored as ORC.
Here are a few findings.
1. Overall, Hive-Iceberg runs slightly faster than Hive-Metastore.
2. Some queries run much faster with Hive-Iceberg. Examples)
query 14-1) Hive-Metastore: 61 seconds, Hive-Iceberg: 28 seconds
query 78) Hive-Metastore: 141 seconds, Hive-Iceberg: 58 seconds
3. Some queries run much slower with Hive-Iceberg. Example)
query 22: Hive-Metastore: 32 seconds, Hive-Iceberg: 356 seconds
(The slow execution is due to InputInitializer generating only 4 tasks for
the
first Map vertex.)
4. Out of 99 queries, 98 queries return correct results, but query 64
returns
wrong results (returning 0 rows) due to an exception:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
hdfs://blue0:8020/tmp/hive/user/35d3bdd7-4fda-4f3d-818d-048ad6242072/hive_2022-11-14_15-26-21_045_8992557056967167667-1/-mr-10001/.hive-staging_hive_2022-11-14_15-26-21_045_8992557056967167667-1/-ext-10002
--- Sungwoo