[GitHub] [iceberg] parasj commented on issue #6456: Performance regression on the TPC-DS refresh benchmark during merges with Iceberg 1.1.0 MoR versus Iceberg 0.14.0 MoR

GitBox Tue, 20 Dec 2022 09:59:03 -0800


parasj commented on issue #6456:
URL: https://github.com/apache/iceberg/issues/6456#issuecomment-1359913918


   Thanks for looking into this @singhpk234. The benchmark is Section 5 from 
the [TPC-DS 
spec](https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v3.2.0.pdf).
 There isn't a need to review this most likely since I can share the specific 
query that causes an issue (MERGE INTO aka MergeIntoIcebergTable).
   
   If I use the default `fs.s3.maxConnections` value, I receive the `Timeout 
waiting for connection from pool` error. Following [EMR 
documentation](https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/),
 I increase that value to at least 400 which resolves the error. However, task 
runtime increases substantially on the 9th or 10th MERGE INTO iteration.
   
   This is the query plan for the slow MERGE operation
   
![screencapture-p-1q6rmnav5mkct-emrappui-prod-us-west-2-amazonaws-shs-history-application-1670734820778-0002-SQL-execution-2022-12-20-09_52_35](https://user-images.githubusercontent.com/453850/208733531-d8a4967c-a1aa-40d2-af4f-eb7966466972.png)
   
   Looking at the relevant job, we can see that a single worker is creating an 
issue. However, this issue occurs consistently across many different EMR 
clusters, so this is not caused by a bad worker.
   
   
![screencapture-p-1q6rmnav5mkct-emrappui-prod-us-west-2-amazonaws-shs-history-application-1670734820778-0002-stages-stage-2022-12-20-09_54_28](https://user-images.githubusercontent.com/453850/208734202-0e051612-ac44-4347-a5dc-b1b63161c420.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] parasj commented on issue #6456: Performance regression on the TPC-DS refresh benchmark during merges with Iceberg 1.1.0 MoR versus Iceberg 0.14.0 MoR

Reply via email to