jasonf20 opened a new pull request, #9323: URL: https://github.com/apache/iceberg/pull/9323
**Explanation** Certain data production patterns can result in a bunch of micro-batch updates that need to be applied to the table sequentially. If these batches include updates they need to be committed to matching data sequence numbers to only apply the deletes of each batch to the previous batches. Currently, this is achievable by creating a transaction and committing each batch ``` for batch in batches: delta = transaction.newRowDelta() delta.add(batch.deletes) delta.add(batch.inserts) delta.commit() transaction.commit() ``` However, this is very slow since it produces a manifest file for each batch and writes that file out to the filesystem. Instead, I propose an API that produces a single manifest with files at different data sequence numbers (like you would get after re-writing the manifests) immediately. ``` update = table.newStreamingUpdate() for batch, batchIndex in enumerate(batches): update.newBatch() update.add(batch.deleteFiles) update.add(batch.dataFiles) update.commit() ``` The API will produce 1 delete file and 1 data file manifest (or more if it gets too large) where each batch advances the data sequence number by 1. This way : * Deletes for previous batches don't apply to new data. * Deletes do apply for all data written before the delete. This PR adds this API. I will add a sample benchmark in the first comment -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org