Re: Result of the TPC-DS benchmark using Iceberg,

Sungwoo Park Mon, 28 Nov 2022 01:13:11 -0800

For query 22, iceberg.mr.split.size affects the number of mappers. With thedefault value of 128MB, Hive creates much fewer mappers than it does on ORCtables.

For query 64, it is due to a bug in shared work optimization. Settinghive.optimize.shared.work.extended to false produces correct results for query64.

Because of several bugs in shared work optimization (and parallel edge fixer),it might make sense to set the default value ofhive.optimize.shared.work to false in HiveConf.java.


--- Sungwoo

On Fri, 18 Nov 2022, Sungwoo Park wrote:

Hello Stamatis,

We use a recent or the latest commit in the master branch and run Hive on Tez0.10.2.

For query 22, the slow execution seems to be related to the split size usedin IcebergInputFormat.getSplits(). We will try to create a JIRA when we makemore progress.

For query 64, the result is wrong (returning 0 rows) on 1TB TPC-DS, but thereis a separate report that the result is correct on 100GB TPC-DS. Not sure whythis happens, so we are going to run more experiments.


Best,

Sungwoo

On Thu, 17 Nov 2022, Stamatis Zampetakis wrote:

Hi Sungwoo,

Many thanks for sharing your findings; interesting observations.

If you can please also share the project versions that you used for running
the experiments.

Best,
Stamatis

On Tue, Nov 15, 2022 at 12:46 PM Sungwoo Park <c...@pl.postech.ac.kr> wrote:

Hello,

I ran the TPC-DS benchmark using Metastore (in the traditional way) and
Iceberg,
and would like to share the result for those interested in Hive using
Iceberg.
The experiment used 1TB TPC-DS dataset stored as ORC.

Here are a few findings.

1. Overall, Hive-Iceberg runs slightly faster than Hive-Metastore.

2. Some queries run much faster with Hive-Iceberg. Examples)
query 14-1) Hive-Metastore: 61 seconds, Hive-Iceberg: 28 seconds
query 78) Hive-Metastore: 141 seconds, Hive-Iceberg: 58 seconds

3. Some queries run much slower with Hive-Iceberg. Example)
query 22: Hive-Metastore: 32 seconds, Hive-Iceberg: 356 seconds
(The slow execution is due to InputInitializer generating only 4 tasks for
the
first Map vertex.)

4. Out of 99 queries, 98 queries return correct results, but query 64
returns
wrong results (returning 0 rows) due to an exception:

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:

hdfs://blue0:8020/tmp/hive/user/35d3bdd7-4fda-4f3d-818d-048ad6242072/hive_2022-11-14_15-26-21_045_8992557056967167667-1/-mr-10001/.hive-staging_hive_2022-11-14_15-26-21_045_8992557056967167667-1/-ext-10002

--- Sungwoo

Re: Result of the TPC-DS benchmark using Iceberg,

Reply via email to