hiloboy0119 opened a new pull request, #10444: URL: https://github.com/apache/iceberg/pull/10444
This PR adds a Spark micro-batch planner that reads table snapshots asynchronously in the background and fills a queue of files which can be consumed to compute latestOffset. This allows the metadata to be read and the queue to fill while a micro-batch is executing. For batches with sufficient execution times this makes planning effectively instantaneous (vs the minutes we observed in production). As the executors aren't doing any work while the driver does this planning, the default behavior greatly restricts a jobs throughput (especially for tables with many files, and many writers). I am running this in production and it has doubled the throughput of my jobs. I still have a few things to do before I expect this to be acceptable to merge: - Add back in the old synchronous behavior and make sure tests cover both cases - Cleanup and formatting - Add test cases and fix bugs I encounter while my jobs are running (I'm sure there are a few lurking as this code is unfortunately complex) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org