szehon-ho commented on PR #6581:
URL: https://github.com/apache/iceberg/pull/6581#issuecomment-1387630786

   Chatting with @aokolnychyi , @RussellSpitzer , a guide to when this can be 
used.
   
   
   There will be two types of operations that can remove delete files:
   
   | Operation   | Cost | File Type | Description |
   | --- | --- | --- | --- | 
   | RemoveDanglingDeletes   |  Metadata-Only, cost will be like querying 
files/partition table  |  Both | Removes position deletes with sequence number 
less than that of the min sequence number of all data files in each partition |
   | RewritePositionDeletes   |   Data-operation, need to read/write all 
concerned delete files | Position only (Equality Deletes will need to be 
converted to PositionDeletes) | Read all position delete files satisfying given 
filter, write them back out , filtering out position delete entries that refer 
to data files that no longer exist |
   
   Use-case, RemoveDanglingDeleteFiles is cheaper, and is the only one to work 
across both types of files.  However, to get it to exactly work, we need the 
following conditions:  RewriteDataFiles being run with:
   * Filter that includes entire partition(s)
   * All data files in the partition with delete files gets rewritten, ie any 
of these:
     * rewrite-all=true
     * delete-file-threshold=1
     * All data files happen to meet the criteria of rewrite without these 
flags.
   * 'use-starting-sequence-number' needs to be false.  This is to properly 
identify old delete files as invalid using sequence number rule.  This is only 
needed for position-deletes, as equality-deletes are not applied to equivalent 
sequence number.
   
   Note RemoveDanglingDeleteFiles can still remove some delete files if these 
conditions are not met, but just it may not do so for all delete files, because 
an old data file (one with a low sequence number) not rewritten will prevent 
delete files from getting removed.
   
   So Im open to whether there is a good use-case of this.  One idea is to 
bundle this with RewriteDataFiles, and if trigger optimistically if these 
conditions are met, or trigger in any case in hopes it will remove delete files 
as, as its relatively cheap.
   
   Otherwise, the complete solution (all to be developed) would be:
   For position deletes, run RewritePositionDeletes across all partitions
   For equality deletes, run ConvertToPosDeletes, then RewritePositionDeletes 
across all partitions.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to