Re: [PR] Doc: Update rewrite data files spark procedure [iceberg]

via GitHub Fri, 25 Oct 2024 20:07:34 -0700


singhpk234 commented on code in PR #11396:
URL: https://github.com/apache/iceberg/pull/11396#discussion_r1817628531



##########
docs/docs/spark-procedures.md:
##########
@@ -402,7 +403,8 @@ Iceberg can compact data files in parallel using Spark with 
the `rewriteDataFile
 | `rewrite-all` | false | Force rewriting of all provided files overriding 
other options |
 | `max-file-group-size-bytes` | 107374182400 (100GB) | Largest amount of data 
that should be rewritten in a single file group. The entire rewrite operation 
is broken down into pieces based on partitioning and within partitions based on 
size into file-groups.  This helps with breaking down the rewriting of very 
large partitions which may not be rewritable otherwise due to the resource 
constraints of the cluster. |
 | `delete-file-threshold` | 2147483647 | Minimum number of deletes that needs 
to be associated with a data file for it to be considered for rewriting |
-
+| `output-spec-id` | current partition spec id | Desired partition spec ID to 
be used for rewriting data files. This allows data files to be rewritten with 
one of existing partition specs. |
+| `remove-dangling-deletes` | false | Remove dangling position and equality 
deletes after rewriting. A delete file is considered dangling if it does not 
apply to any live data files. Enabling this will generate an additional 
snapshot of the delete type. |

Review Comment:
   > Enabling this will generate an additional snapshot of the delete type.
   
   1/  if this commit fails does the whole job fails ? 
   2/ can you please elaborate on what does `snapshot of delete type` means 
from what i am understand the operation of this update should be replace as we 
use `table.newRewrite()` API



##########
docs/docs/spark-procedures.md:
##########
@@ -402,7 +403,8 @@ Iceberg can compact data files in parallel using Spark with 
the `rewriteDataFile
 | `rewrite-all` | false | Force rewriting of all provided files overriding 
other options |
 | `max-file-group-size-bytes` | 107374182400 (100GB) | Largest amount of data 
that should be rewritten in a single file group. The entire rewrite operation 
is broken down into pieces based on partitioning and within partitions based on 
size into file-groups.  This helps with breaking down the rewriting of very 
large partitions which may not be rewritable otherwise due to the resource 
constraints of the cluster. |
 | `delete-file-threshold` | 2147483647 | Minimum number of deletes that needs 
to be associated with a data file for it to be considered for rewriting |
-
+| `output-spec-id` | current partition spec id | Desired partition spec ID to 
be used for rewriting data files. This allows data files to be rewritten with 
one of existing partition specs. |
+| `remove-dangling-deletes` | false | Remove dangling position and equality 
deletes after rewriting. A delete file is considered dangling if it does not 
apply to any live data files. Enabling this will generate an additional 
snapshot of the delete type. |

Review Comment:
   There is still limitations to removal of dangling deletes right ? as if not 
all dangling deletes can be removed ? 
   as per limitation section here : 
https://docs.google.com/document/d/11d-cIUR_89kRsMmWnEoxXGZCvp7L4TUmPJqUC60zB5M/edit?tab=t.0#heading=h.csuw3vlddikv
   
   wondering if its worth calling that out ~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Doc: Update rewrite data files spark procedure [iceberg]

Reply via email to