[PR] [WIP]Spark: Asynchronous Spark Micro Batch Planner [iceberg]

via GitHub Tue, 04 Jun 2024 13:33:58 -0700


hiloboy0119 opened a new pull request, #10444:
URL: https://github.com/apache/iceberg/pull/10444


   This PR adds a Spark micro-batch planner that reads table snapshots 
asynchronously in the background and fills a queue of files which can be 
consumed to compute latestOffset.  This allows the metadata to be read and the 
queue to fill while a micro-batch is executing.  For batches with sufficient 
execution times this makes planning effectively instantaneous (vs the minutes 
we observed in production).  As the executors aren't doing any work while the 
driver does this planning, the default behavior greatly restricts a jobs 
throughput (especially for tables with many files, and many writers).  I am 
running this in production and it has doubled the throughput of my jobs.
   
   I still have a few things to do before I expect this to be acceptable to 
merge:
   
   - Add back in the old synchronous behavior and make sure tests cover both 
cases
   - Cleanup and formatting
   - Add test cases and fix bugs I encounter while my jobs are running (I'm 
sure there are a few lurking as this code is unfortunately complex)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] [WIP]Spark: Asynchronous Spark Micro Batch Planner [iceberg]

Reply via email to