JonasJ-ap commented on issue #6781: URL: https://github.com/apache/iceberg/issues/6781#issuecomment-1426363841
Thank you for your suggestion. I see the point that we should make the migration action work out-of-box. I will focus on the solution 2. > Does each log has a unique ID? Is that useable by end users? Each delta log has a unique id called `version` associate with it. Delta Lake also have public API to construct a snapshot up til the specified log/version. Users can also give exact or rough timestamp and use APIs like [`getVersionAtOrAfterTimestamp`](https://delta-io.github.io/connectors/latest/delta-standalone/api/java/io/delta/standalone/DeltaLog.html#getVersionAtOrAfterTimestamp-long-) or [`getSnapshotForTimestampAsOf`](https://delta-io.github.io/connectors/latest/delta-standalone/api/java/io/delta/standalone/DeltaLog.html#getSnapshotForTimestampAsOf-long-) to retrieve the version number or the snapshot directly. Since delta lake offers many APIs to interact with the version number, I initially thought letting users configure the starting version may be a good idea. But now I agree that the starting version is better to be an internal conception in our migration logic > s that the same experience in Databricks Delta? I think it should not be, because there needs to be a process to keep the delta log size short I am not quite sure about he difference between Databricks Delta and Detla Lake tables on other platform (such as AWS). (My Databricks free account expires several weeks ago....). But for the delta log size I believe it is maintained separately and automatically by the Delta Lake table. Ref: https://docs.delta.io/latest/delta-batch.html#-data-retention Delta Lake Logs that are older than `delta.logRententionDuration` will be deleted each time a checkpoint is formed (normally each 10 commits). Currently we rely on [`getVersionAtOrAfterTimestamp`](https://delta-io.github.io/connectors/latest/delta-standalone/api/java/io/delta/standalone/DeltaLog.html#getVersionAtOrAfterTimestamp-long-) API to determine the earliest possible log version in the current table rather than hardcode `version 0` at the early stage of that PR. So I think Delta's log cleaning process is handled properly by this process. I will double-check this when doing the fix for `VACUUM` and apply similar try-catch logic to determine the start version if necessary. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org