Re: [PR] Support Spark Column Stats [iceberg]

via GitHub Fri, 26 Jul 2024 00:51:56 -0700


jeesou commented on PR #10659:
URL: https://github.com/apache/iceberg/pull/10659#issuecomment-2252176015


   Hi @huaxingao , @karuppayya 
   cc : @RussellSpitzer
   We were running some tests on Spark with the latest codes.
   We took the changes of the previous PR 
https://github.com/apache/iceberg/pull/10288, along with this PR changes, and 
generated the .stats file and we can see the ndv values in the metadata.
   
   So to verify the performance enhancement we ran the TPCH queries. On running 
the query at 1000 G scale we are facing some issues on certain queries (query 
umbers - 5,7,8,9,10) where while performing the broadcast join, some error 
occurred.
   
   I am sharing the log for query number 8 you can check.
   I am sharing logical plan as well.
   
   ![Screenshot 2024-07-26 at 1 01 19 
PM](https://github.com/user-attachments/assets/700b9132-ebca-406f-aafb-cd7f04ffc401)
   ![Screenshot 2024-07-26 at 1 02 03 
PM](https://github.com/user-attachments/assets/beb4b619-fb16-4571-9763-1ac57398ae2f)
   
   Sharing the error log for query 5 as well
   ![Screenshot 2024-07-26 at 1 05 58 
PM](https://github.com/user-attachments/assets/7415df15-9af9-4620-bf06-dc95f79c5c90)
   
   
   Sharing the config we used for reference -
   
   "spark.executor.cores": "6",
   "spark.executor.memory": "24G",
   "spark.driver.cores": "6",
   "spark.driver.memory": "24G",
   "spark.driver.maxResultSize":"0",
   "spark.sql.iceberg.enable-column-stats": "true",
   "spark.sql.cbo.enabled": "true",
   "spark.sql.cbo.joinReorder.enabled": "true",
   
   we have tried by upscaling the executor and driver cores and memoryto 12/48 
scale. but received the same issue.
   
   Kindly help us understand if we are missing anything out, or is this an 
issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Support Spark Column Stats [iceberg]

Reply via email to