szehon-ho commented on code in PR #11396: URL: https://github.com/apache/iceberg/pull/11396#discussion_r1824797640
########## docs/docs/spark-procedures.md: ########## @@ -402,7 +403,12 @@ Iceberg can compact data files in parallel using Spark with the `rewriteDataFile | `rewrite-all` | false | Force rewriting of all provided files overriding other options | | `max-file-group-size-bytes` | 107374182400 (100GB) | Largest amount of data that should be rewritten in a single file group. The entire rewrite operation is broken down into pieces based on partitioning and within partitions based on size into file-groups. This helps with breaking down the rewriting of very large partitions which may not be rewritable otherwise due to the resource constraints of the cluster. | | `delete-file-threshold` | 2147483647 | Minimum number of deletes that needs to be associated with a data file for it to be considered for rewriting | +| `output-spec-id` | current partition spec id | Desired partition spec ID to be used for rewriting data files. This allows data files to be rewritten with any existing partition spec. | Review Comment: I thought the original javadoc is well rewritten here as well, how about taking from that: ``` Identifier of the output partition spec. Data will be reorganized during the rewrite to align with the output partitioning. ``` ########## docs/docs/spark-procedures.md: ########## @@ -402,7 +403,12 @@ Iceberg can compact data files in parallel using Spark with the `rewriteDataFile | `rewrite-all` | false | Force rewriting of all provided files overriding other options | | `max-file-group-size-bytes` | 107374182400 (100GB) | Largest amount of data that should be rewritten in a single file group. The entire rewrite operation is broken down into pieces based on partitioning and within partitions based on size into file-groups. This helps with breaking down the rewriting of very large partitions which may not be rewritable otherwise due to the resource constraints of the cluster. | | `delete-file-threshold` | 2147483647 | Minimum number of deletes that needs to be associated with a data file for it to be considered for rewriting | +| `output-spec-id` | current partition spec id | Desired partition spec ID to be used for rewriting data files. This allows data files to be rewritten with any existing partition spec. | +| `remove-dangling-deletes` | false | Remove dangling position and equality deletes after rewriting. A delete file is considered dangling if it does not apply to any live data files. Enabling this will generate an additional snapshot. | Review Comment: This is just my thought, but last sentence could be clarified a little: 'This will generate an additional commit for the removal' ########## docs/docs/spark-procedures.md: ########## @@ -402,7 +403,12 @@ Iceberg can compact data files in parallel using Spark with the `rewriteDataFile | `rewrite-all` | false | Force rewriting of all provided files overriding other options | | `max-file-group-size-bytes` | 107374182400 (100GB) | Largest amount of data that should be rewritten in a single file group. The entire rewrite operation is broken down into pieces based on partitioning and within partitions based on size into file-groups. This helps with breaking down the rewriting of very large partitions which may not be rewritable otherwise due to the resource constraints of the cluster. | | `delete-file-threshold` | 2147483647 | Minimum number of deletes that needs to be associated with a data file for it to be considered for rewriting | +| `output-spec-id` | current partition spec id | Desired partition spec ID to be used for rewriting data files. This allows data files to be rewritten with any existing partition spec. | +| `remove-dangling-deletes` | false | Remove dangling position and equality deletes after rewriting. A delete file is considered dangling if it does not apply to any live data files. Enabling this will generate an additional snapshot. | +!!! info Review Comment: Nit: should we also add about position deletes? , nor to position delete files containing position deletes no longer matching any live data files. ########## docs/docs/spark-procedures.md: ########## @@ -393,6 +393,7 @@ Iceberg can compact data files in parallel using Spark with the `rewriteDataFile | `max-concurrent-file-group-rewrites` | 5 | Maximum number of file groups to be simultaneously rewritten | | `partial-progress.enabled` | false | Enable committing groups of files prior to the entire rewrite completing | | `partial-progress.max-commits` | 10 | Maximum amount of commits that this rewrite is allowed to produce if partial progress is enabled | +| `partial-progress.max-failed-commits` | same as `partital-progress.max-commits` | Maximum amount of failed commits that this rewrite can have before job failure, if partial progress is enabled | Review Comment: i think javadoc says 'that this rewrite can have' => 'is allowed' Opt: same as => value of? (very slightly clearer?) ########## docs/docs/spark-procedures.md: ########## @@ -447,9 +453,9 @@ Using the same defaults as bin-pack to determine which files to rewrite. CALL catalog_name.system.rewrite_data_files(table => 'db.sample', strategy => 'sort', sort_order => 'zorder(c1,c2)'); ``` -Rewrite the data files in table `db.sample` using bin-pack strategy in any partition where more than 2 or more files need to be rewritten. +Rewrite the data files in table `db.sample` using bin-pack strategy in any partition where at least two files need rewriting, and then remove any dangling delete files. ```sql -CALL catalog_name.system.rewrite_data_files(table => 'db.sample', options => map('min-input-files','2')); +CALL catalog_name.system.rewrite_data_files(table => 'db.sample', options => map('min-input-files','2','remove-dangling-deletes', 'true')); Review Comment: Do you think we can make spaces in the whole map for consistency? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org