JonasJ-ap commented on issue #6781:
URL: https://github.com/apache/iceberg/issues/6781#issuecomment-1425153055

   Some context and my thoughts here:
   
   Reference: delta lake's 
[doc](https://docs.delta.io/latest/delta-utility.html): 
   1. `VACUUM` delete only data files, not log files
   2. `VACUUM` can only be called manually
   
   The `1` will cause `IOException` when migrate constructible snapshots' 
corresponding datafiles are cleaned. 
   The `2` makes the operation timestamp untracked as delta lake does not 
record `VACUUM` operation in logs based on my understanding. 
   There are two ways to configure delete candidate of `VACUUM`
   1. table property: `delta.deletedFileRetentionDuration`, default to 7 days
   2. manually specify the retention period: `VACUUM ... RETAIN <any> days`
   
   Based on these properties of `VACUUM`, it seems the entity that called 
`VACUUM` should keep track of the earliest versoin that can time travel back to 
after each execution of `VACUUM`. This will lead to my first proposed solution 
that this issue can be solved by a new feature that is word to be added to the 
conversion logic: Currently, the conversion logic starts to migrate from the 
earlist possible log version. We can add a property to let user set the start 
version. Users can use this property to skip those snashots whose datafiles are 
deleted.
   
   The second solution is that we can catch the `IOException` when trying to 
build the `DataFile` and skip the whole snapshot if any parquet file can not be 
found. Specifically, we should only catch the exception when there has been no 
version migrated yet. If there are some successfully migrated snapshot earlier, 
then the `IOException` must be caused by something else and we shall not skip 
the version as delta logs are consecutive. My concern here is that this feature 
may cause inconsistency between the user setting and the actual action result: 
e.g. Users may set the starting point at version A but the actual starting 
point will be moved to version B and users will not notice that easily. My 
thought here is to add the actual starting version to the `Action.Result` 
report or we can add another property with name like `autoDetectStartingPoint` 
such that we will still throw exception as normal if users do not set this to 
`true` 
   
   Indeed, we can do both. The first proposal is a good feature to be added 
anyway and the second can make the conversion logic more robust.
   
   I want to receive some feedback on these proposals before I start to 
implement them. If you have some comments or a better solution, please let me 
know. Thank you in advance for your help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to