jackye1995 commented on PR #7194: URL: https://github.com/apache/iceberg/pull/7194#issuecomment-1483284297
Thanks for putting this up @rajarshisarkar. For some background, Brooklyn Data's [blog](https://brooklyndata.co/blog/benchmarking-open-table-formats) cited that Iceberg read workloads were 7x-8x slower against Delta when an UPSERT command added 92000 small files. We reproduced the setup internally and noticed a speed-up up to 6.8x in the Iceberg read queries after combining the small files The community also saw a 6.25x improvement in read query performance after compaction on a 25MB dataset consisting of 100,000 records in https://github.com/apache/iceberg/issues/5997. I understand the difference is there because we want to decouple optimization from read and write, but I am curious to see if we could provide some out-of-the-box optimization vendor integrations in this way through the metrics reporter if the user does not want to use any auto-optimization solution. @nastra @Fokko @rdblue @danielcweeks please let us know if this is something that the community is interested in taking, or if not, how we could add some integrations in a community-friendly way to close the gap in table format comparisons like this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
