pvary commented on code in PR #15566:
URL: https://github.com/apache/iceberg/pull/15566#discussion_r2931095026
##########
flink/v2.1/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java:
##########
@@ -626,6 +633,92 @@ public Builder setSnapshotProperty(String property, String
value) {
return this;
}
+ /**
+ * Enables or disables compaction (rewriting data files) as a post-commit
maintenance task.
+ *
+ * @param enabled whether to enable compaction
+ * @see RewriteDataFilesConfig for the default config.
+ * @deprecated See {@code rewriteDatafiles(..)}
+ */
+ @Deprecated
+ public Builder compaction(boolean enabled) {
+ writeOptions.put(FlinkWriteOptions.COMPACTION_ENABLE.key(),
Boolean.toString(enabled));
+ return this;
+ }
+
+ /**
+ * Enables or disables compaction (rewriting data files) as a post-commit
maintenance task.
+ *
+ * @param enabled whether to enable compaction
+ * @see RewriteDataFilesConfig for the default config.
+ */
+ public Builder rewriteDataFiles(boolean enabled) {
+ writeOptions.put(FlinkWriteOptions.COMPACTION_ENABLE.key(),
Boolean.toString(enabled));
+ return this;
+ }
+
+ /**
+ * Enables or disables compaction (rewriting data files) as a post-commit
maintenance task.
+ *
+ * @param enabled whether to enable compaction
+ * @param config task-specific configuration, see {@link
RewriteDataFilesConfig} for available
+ * keys
+ */
+ public Builder rewriteDataFiles(boolean enabled, Map<String, String>
config) {
+ rewriteDataFiles(enabled);
+ writeOptions.putAll(config);
+ return this;
+ }
+
+ /**
+ * Enables or disables expire snapshots as a post-commit maintenance task.
+ *
+ * @param enabled whether to enable expire snapshots
+ * @see ExpireSnapshotsConfig for the default config.
+ */
+ public Builder expireSnapshots(boolean enabled) {
+ writeOptions.put(FlinkWriteOptions.EXPIRE_SNAPSHOTS_ENABLE.key(),
Boolean.toString(enabled));
Review Comment:
> Defaults are hard. These are the current defaults:
>
> ExpireSnapshotsConfig defaults:
>
> * schedule.commit-count: 10
> * schedule.data-file-count: 1,000
> * schedule.data-file-size: 100 GB
> * schedule.interval-second: 3600 (1 hour)
> * max-snapshot-age-seconds: none
> * retain-last: none
> * delete-batch-size: 1,000
> * clean-expired-metadata: false
> * planning-worker-pool-size: none (shared pool)
Can we remove `schedule.data-file-count`, `schedule.data-file-size` - I
don't thing we need those for expire snapshots.
Maybe we should set `clean-expired-metadata`? I think keeping is only makes
sense when we want to keep the old functionality. We are introducing a new one.
We should remove unused metadata as soon as possible.
> DeleteOrphanFilesConfig defaults:
>
> * schedule.commit-count: 10
> * schedule.data-file-count: 1,000
> * schedule.data-file-size: 100 GB
> * schedule.interval-second: 3600 (1 hour)
> * min-age-seconds: 259,200 (3 days)
> * delete-batch-size: 1,000
> * use-prefix-listing: false
> * planning-worker-pool-size: none (shared pool)
> * equal-schemes: none
> * equal-authorities: none
> * prefix-mismatch-mode: ERROR
Maybe we could remove all of the `schedule` configs other than the interval.
The others shouldn't have to do anything with the delete orphan files
Maybe setting `use-prefix-listing` to true would be better. It should
perform better in most cases, and only edge cases need the recursive listing.
What do you think about these @Guosmilesmile?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]