szehon-ho commented on issue #7431:
URL: https://github.com/apache/iceberg/issues/7431#issuecomment-1542897084
Hi, thanks. Yea I am comparing the two stage trees, nothing immediately
jumps out to me to say why MOR is 40 mins and the other is 17 mins.
Comparing the two joins,
* MOR join has shuffle bytes: 207.6 GB and 50.3 GB.
* COW has two joins (first for determining list of files, second is the
actual join with filter on the file list)
* 1st join has shuflfe bytes: 10.3G and 7.0 G (file projection
calcluation)
* 2nd join has 156.9 G and 109.3 G (final join after file filter)
So just based on that, I dont see huge difference here.
Also I notice that the data for the two runs may be different, you have 785
million and 499 million rows for COW, and then 261 million and 498 million for
MOR.
I am not a huge expert on Spark UI, but is there some where you can see how
long each stage takes? Hope you dont have to re-run both jobs but can just get
it from the History.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]