dmgcodevil commented on PR #6174:
URL: https://github.com/apache/iceberg/pull/6174#issuecomment-1532380590

   I've found the following option:
   
   ```
     /**
      * The entire rewrite operation is broken down into pieces based on 
partitioning and within partitions based
      * on size into groups. These sub-units of the rewrite are referred to as 
file groups. The largest amount of data that
      * should be compacted in a single group is controlled by {@link 
#MAX_FILE_GROUP_SIZE_BYTES}. This helps with
      * breaking down the rewriting of very large partitions which may not be 
rewritable otherwise due to the resource
      * constraints of the cluster. For example a sort based rewrite may not 
scale to terabyte sized partitions, those
      * partitions need to be worked on in small subsections to avoid 
exhaustion of resources.
      * <p>
      * When grouping files, the underlying rewrite strategy will use this 
value as to limit the files which
      * will be included in a single file group. A group will be processed by a 
single framework "action". For example,
      * in Spark this means that each group would be rewritten in its own Spark 
action. A group will never contain files
      * for multiple output partitions.
      */
     String MAX_FILE_GROUP_SIZE_BYTES = "max-file-group-size-bytes";
   ```
   
   However, would it make sense to limit the number of groups for compaction?
   
   cc/ @rdblue 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to