JonasJ-ap commented on issue #6781:
URL: https://github.com/apache/iceberg/issues/6781#issuecomment-1426363841

   Thank you for your suggestion. I see the point that we should make the 
migration action work out-of-box. I will focus on the solution 2.
   
   >  Does each log has a unique ID? Is that useable by end users? 
   
   Each delta log has a unique id called `version` associate with it. Delta 
Lake also have public API to construct a snapshot up til the specified 
log/version. Users can also give exact or rough timestamp and use APIs like 
[`getVersionAtOrAfterTimestamp`](https://delta-io.github.io/connectors/latest/delta-standalone/api/java/io/delta/standalone/DeltaLog.html#getVersionAtOrAfterTimestamp-long-)
 or 
[`getSnapshotForTimestampAsOf`](https://delta-io.github.io/connectors/latest/delta-standalone/api/java/io/delta/standalone/DeltaLog.html#getSnapshotForTimestampAsOf-long-)
 to retrieve the version number or the snapshot directly. Since delta lake 
offers many APIs to interact with the version number, I initially thought 
letting users configure the starting version may be a good idea. But now I 
agree that the starting version is better to be an internal conception in our 
migration logic
   
   > s that the same experience in Databricks Delta? I think it should not be, 
because there needs to be a process to keep the delta log size short
   
   I am not quite sure about he difference between Databricks Delta and Detla 
Lake tables on other platform (such as AWS). (My Databricks free account 
expires several weeks ago....). But for the delta log size I believe it is 
maintained separately and automatically by the Delta Lake table.
   
   Ref: https://docs.delta.io/latest/delta-batch.html#-data-retention
   Delta Lake Logs that are older than `delta.logRententionDuration` will be 
deleted each time a checkpoint is formed (normally each 10 commits). Currently 
we rely on 
[`getVersionAtOrAfterTimestamp`](https://delta-io.github.io/connectors/latest/delta-standalone/api/java/io/delta/standalone/DeltaLog.html#getVersionAtOrAfterTimestamp-long-)
 API to determine the earliest  possible log version in the current table 
rather than hardcode `version 0` at the early stage of that PR. So I think 
Delta's log cleaning process is handled properly by this process. I will 
double-check this when doing the fix for `VACUUM` and apply similar try-catch 
logic to determine the start version if necessary. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to