[PR] API: New API For sequential / streaming updates [iceberg]

via GitHub Sun, 17 Dec 2023 07:03:17 -0800


jasonf20 opened a new pull request, #9323:
URL: https://github.com/apache/iceberg/pull/9323


   **Explanation**
   Certain data production patterns can result in a bunch of micro-batch 
updates that need to be applied to the table sequentially.  If these batches 
include updates they need to be committed to matching data sequence numbers to 
only apply the deletes of each batch to the previous batches. 
   
   Currently, this is achievable by creating a transaction and committing each 
batch
   ```
   for batch in batches:
       delta = transaction.newRowDelta()
       delta.add(batch.deletes)
       delta.add(batch.inserts)
       delta.commit()
   transaction.commit()
   ```
   
   However, this is very slow since it produces a manifest file for each batch 
and writes that file out to the filesystem. 
   
   Instead, I propose an API that produces a single manifest with files at 
different data sequence numbers (like you would get after re-writing the 
manifests) immediately. 
   ```
   update = table.newStreamingUpdate()
   for batch, batchIndex in enumerate(batches):
       update.newBatch()
       update.add(batch.deleteFiles)
       update.add(batch.dataFiles)
   update.commit()
   ```
   
   The API will produce 1 delete file and 1 data file manifest (or more if it 
gets too large) where each batch advances the data sequence number by 1. This 
way :
   * Deletes for previous batches don't apply to new data.
   * Deletes do apply for all data written before the delete.  
   
   This PR adds this API.
   
   I will add a sample benchmark  in the first comment


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] API: New API For sequential / streaming updates [iceberg]

Reply via email to