Re: [PR] Support Spark Column Stats [iceberg]

via GitHub Fri, 26 Jul 2024 01:11:19 -0700


saitharun15 commented on PR #10659:
URL: https://github.com/apache/iceberg/pull/10659#issuecomment-2252205197


   > Hi @huaxingao , @karuppayya cc : @RussellSpitzer We were running some 
tests on Spark with the latest codes. We took the changes of the previous PR 
#10288, along with this PR changes, and generated the .stats file and we can 
see the ndv values in the metadata.
   > 
   > So to verify the performance enhancement we ran the TPCH queries. On 
running the query at 1000 G scale we are facing some issues on certain queries 
(query umbers - 5,7,8,9,10) where while performing the broadcast join, some 
error occurred.
   > 
   > I am sharing the log for query number 8 you can check. I am sharing 
logical plan as well.
   > 
   > ![Screenshot 2024-07-26 at 1 01 19 
PM](https://private-user-images.githubusercontent.com/48854046/352437819-700b9132-ebca-406f-aafb-cd7f04ffc401.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE5ODEwOTAsIm5iZiI6MTcyMTk4MDc5MCwicGF0aCI6Ii80ODg1NDA0Ni8zNTI0Mzc4MTktNzAwYjkxMzItZWJjYS00MDZmLWFhZmItY2Q3ZjA0ZmZjNDAxLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MjYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzI2VDA3NTk1MFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTc3OTViMjI4ZmZhMjk3Y2ZhNTYwMzNkNmY4MjM2MjQ3YjlkOGI4ZDFhMWI3MGMzMzMxMDhmMmQ4ZTY2Mjc0YWQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.eg4IgFE4Ri_Ph5N7HAMIPdn1xcqp10wpr6olBgbhdMk)
 ![Screenshot 2024-07-26 at 1 02 03 
PM](https://private-user-images.githubusercontent.com/48854046/352438615-beb4b619-fb16-4571-976
 
3-1ac57398ae2f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE5ODEwOTAsIm5iZiI6MTcyMTk4MDc5MCwicGF0aCI6Ii80ODg1NDA0Ni8zNTI0Mzg2MTUtYmViNGI2MTktZmIxNi00NTcxLTk3NjMtMWFjNTczOThhZTJmLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MjYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzI2VDA3NTk1MFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTUwMjdmMTUzN2JmNGE5NWY2ZDg3ZWQxM2VhYjEwMWNlZGY4NThiNjNlMjZkOTYxZDU2ZmJmNGYyOTM2ZDZmZjYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.E2ksKdhjoxSd0SC81xXLOGB35T7EXFYHhf1Tq0NXCP8)
   > 
   > Sharing the error log for query 5 as well ![Screenshot 2024-07-26 at 1 05 
58 
PM](https://private-user-images.githubusercontent.com/48854046/352439079-7415df15-9af9-4620-bf06-dc95f79c5c90.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE5ODEwOTAsIm5iZiI6MTcyMTk4MDc5MCwicGF0aCI6Ii80ODg1NDA0Ni8zNTI0MzkwNzktNzQxNWRmMTUtOWFmOS00NjIwLWJmMDYtZGM5NWY3OWM1YzkwLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MjYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzI2VDA3NTk1MFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTcwMTk4N2FjMDQyYzU3ZmJiODcyZjc4YjczMWFhZmNjOGMxZWEzMDI1N2ExOWE0NjM4NTFiOTdlYTJjMDhlMTgmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.d-JLHx39joAQAn7Hr_zXEg0Ix2AnSDHnYuR4IOgwLIE)
   > 
   > Sharing the config we used for reference -
   > 
   > "spark.executor.cores": "6", "spark.executor.memory": "24G", 
"spark.driver.cores": "6", "spark.driver.memory": "24G", 
"spark.driver.maxResultSize":"0", "spark.sql.iceberg.enable-column-stats": 
"true", "spark.sql.cbo.enabled": "true", "spark.sql.cbo.joinReorder.enabled": 
"true",
   > 
   > we have tried by upscaling the executor and driver cores and memoryto 
12/48 scale. but received the same issue.
   > 
   > Kindly help us understand if we are missing anything out, or is this an 
issue.
   
   We saw performance enhancement in couple of queries (1,2,19)  where spark 
was able to perform broadcast join on larger tables where as previously without 
ndv stats, the plans were not that better. And Also cosidering the 5th query 
when ran with hive with statistics and cbo enabled, it was not failing because 
spark used sort merge join instead of broadcast for larger tables, but incase 
of iceberg with ndv stats spark is using only broadcast join and failing due to 
errors:
   
    " org.apache.spark.SparkException: Cannot broadcast the table that is 
larger than 8.0 GiB: 9.0 GiB. 
org.apache.spark.sql.errors.QueryExecutionErrors$.cannotBroadcastTableOverMaxTableBytesError
 and py4j.protocol.Py4JJavaError: An error occurred while calling 
o114.showString.
   : org.apache.spark.SparkUnsupportedOperationException: Can not build a 
HashedRelation that is larger than 8G.
   
org.apache.spark.sql.errors.QueryExecutionErrors$.cannotBuildHashedRelationLargerThan8GError"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Support Spark Column Stats [iceberg]

Reply via email to