[GitHub] [iceberg] parasj opened a new issue, #6456: Performance regression on the TPC-DS refresh benchmark during merges with Iceberg 1.1.0 MoR versus Iceberg 0.14.0 MoR

GitBox Mon, 19 Dec 2022 10:50:50 -0800


parasj opened a new issue, #6456:
URL: https://github.com/apache/iceberg/issues/6456


   ### Apache Iceberg version
   
   1.1.0 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   We are seeing substantially slower performance with Iceberg 1.1.0 MoR when 
compared to Iceberg 0.14.0 MoR over the TPC-DS refresh benchmark. Ideally, we 
expect MERGE latency to be lower for MoR versus CoW tables.
   
   A typical TPC-DS refresh merge takes:
   * Iceberg 0.14.0 MoR: all merges take an average of 128s
   * Iceberg 1.1.0 MoR: merges 1-9 take an average of 564s, merge 10 takes 
12,151s
   
   It seems like we are encountering [this 
issue](https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/)
 with S3 connection pools which leads significant delays due to retries. 
Applying EMR's recommended fix avoids an exception but leads to a significant 
slowdown.
   
   We are using Spark 3.3 with EMR 6.9.0 across 16x i3.2xlarge workers and 1 
i3.2xlarge head node. We are using the following Spark flags as recommended by 
EMR:
   ```
   
["spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
   "spark.sql.catalog.ice=org.apache.iceberg.spark.SparkCatalog",
   
"spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog",
   "spark.sql.catalog.spark_catalog.type=hive",
   "spark.sql.catalog.ice.io-impl=org.apache.iceberg.aws.s3.S3FileIO"]
   ```
   
   Why might 1.1.0 be so much slower?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] parasj opened a new issue, #6456: Performance regression on the TPC-DS refresh benchmark during merges with Iceberg 1.1.0 MoR versus Iceberg 0.14.0 MoR

Reply via email to