saitharun15 commented on PR #10659: URL: https://github.com/apache/iceberg/pull/10659#issuecomment-2252205197
> Hi @huaxingao , @karuppayya cc : @RussellSpitzer We were running some tests on Spark with the latest codes. We took the changes of the previous PR #10288, along with this PR changes, and generated the .stats file and we can see the ndv values in the metadata. > > So to verify the performance enhancement we ran the TPCH queries. On running the query at 1000 G scale we are facing some issues on certain queries (query umbers - 5,7,8,9,10) where while performing the broadcast join, some error occurred. > > I am sharing the log for query number 8 you can check. I am sharing logical plan as well. > >   > > Sharing the error log for query 5 as well  > > Sharing the config we used for reference - > > "spark.executor.cores": "6", "spark.executor.memory": "24G", "spark.driver.cores": "6", "spark.driver.memory": "24G", "spark.driver.maxResultSize":"0", "spark.sql.iceberg.enable-column-stats": "true", "spark.sql.cbo.enabled": "true", "spark.sql.cbo.joinReorder.enabled": "true", > > we have tried by upscaling the executor and driver cores and memoryto 12/48 scale. but received the same issue. > > Kindly help us understand if we are missing anything out, or is this an issue. We saw performance enhancement in couple of queries (1,2,19) where spark was able to perform broadcast join on larger tables where as previously without ndv stats, the plans were not that better. And Also cosidering the 5th query when ran with hive with statistics and cbo enabled, it was not failing because spark used sort merge join instead of broadcast for larger tables, but incase of iceberg with ndv stats spark is using only broadcast join and failing due to errors: " org.apache.spark.SparkException: Cannot broadcast the table that is larger than 8.0 GiB: 9.0 GiB. org.apache.spark.sql.errors.QueryExecutionErrors$.cannotBroadcastTableOverMaxTableBytesError and py4j.protocol.Py4JJavaError: An error occurred while calling o114.showString. : org.apache.spark.SparkUnsupportedOperationException: Can not build a HashedRelation that is larger than 8G. org.apache.spark.sql.errors.QueryExecutionErrors$.cannotBuildHashedRelationLargerThan8GError" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org