Re: [PR] Doc: Update rewrite data files spark procedure [iceberg]

via GitHub Thu, 31 Oct 2024 11:06:14 -0700


szehon-ho commented on code in PR #11396:
URL: https://github.com/apache/iceberg/pull/11396#discussion_r1824947630



##########
docs/docs/spark-procedures.md:
##########
@@ -402,7 +403,13 @@ Iceberg can compact data files in parallel using Spark 
with the `rewriteDataFile
 | `rewrite-all` | false | Force rewriting of all provided files overriding 
other options |
 | `max-file-group-size-bytes` | 107374182400 (100GB) | Largest amount of data 
that should be rewritten in a single file group. The entire rewrite operation 
is broken down into pieces based on partitioning and within partitions based on 
size into file-groups.  This helps with breaking down the rewriting of very 
large partitions which may not be rewritable otherwise due to the resource 
constraints of the cluster. |
 | `delete-file-threshold` | 2147483647 | Minimum number of deletes that needs 
to be associated with a data file for it to be considered for rewriting |
+| `output-spec-id` | current partition spec id | Identifier of the output 
partition spec.  Data will be reorganized during the rewrite to align with the 
output partitioning |

Review Comment:
   Nit: this doc looks inconsistent, but I guess we can do one space to keep 
with the majority?



##########
docs/docs/spark-procedures.md:
##########
@@ -393,6 +393,7 @@ Iceberg can compact data files in parallel using Spark with 
the `rewriteDataFile
 | `max-concurrent-file-group-rewrites` | 5 | Maximum number of file groups to 
be simultaneously rewritten |
 | `partial-progress.enabled` | false | Enable committing groups of files prior 
to the entire rewrite completing |
 | `partial-progress.max-commits` | 10 | Maximum amount of commits that this 
rewrite is allowed to produce if partial progress is enabled |
+| `partial-progress.max-failed-commits` | value of 
`partital-progress.max-commits` | Maximum amount of failed commits is allowed 
before job failure, if partial progress is enabled |

Review Comment:
   sorry maybe my comment mistake.  let's remove 'is', i think it came from 
javadoc but doesnt make sense here.



##########
docs/docs/spark-procedures.md:
##########
@@ -402,7 +403,13 @@ Iceberg can compact data files in parallel using Spark 
with the `rewriteDataFile
 | `rewrite-all` | false | Force rewriting of all provided files overriding 
other options |
 | `max-file-group-size-bytes` | 107374182400 (100GB) | Largest amount of data 
that should be rewritten in a single file group. The entire rewrite operation 
is broken down into pieces based on partitioning and within partitions based on 
size into file-groups.  This helps with breaking down the rewriting of very 
large partitions which may not be rewritable otherwise due to the resource 
constraints of the cluster. |
 | `delete-file-threshold` | 2147483647 | Minimum number of deletes that needs 
to be associated with a data file for it to be considered for rewriting |
+| `output-spec-id` | current partition spec id | Identifier of the output 
partition spec.  Data will be reorganized during the rewrite to align with the 
output partitioning |
+| `remove-dangling-deletes` | false | Remove dangling position and equality 
deletes after rewriting. A delete file is considered dangling if it does not 
apply to any live data files. Enabling this will generate an additional commit 
for the removal |

Review Comment:
   Nit: period at the end



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Doc: Update rewrite data files spark procedure [iceberg]

Reply via email to