Re: [PR] Add ManifestFile Stats in snapshot summary. [iceberg]

via GitHub Tue, 07 May 2024 10:51:00 -0700


nk1506 commented on code in PR #10246:
URL: https://github.com/apache/iceberg/pull/10246#discussion_r1592861026



##########
core/src/main/java/org/apache/iceberg/FastAppend.java:
##########
@@ -156,6 +156,8 @@ public List<ManifestFile> apply(TableMetadata base, 
Snapshot snapshot) {
       manifests.addAll(snapshot.allManifests(ops.io()));
     }
 
+    manifests.forEach(summaryBuilder::addedManifestStats);

Review Comment:
   Thanks for the feedback. 
   
   Regarding the usages of manifest counts for planning here is my feedback:
   
   1. Having Manifest counts in advance helps to plan the parallelism.  Like 
[spark](https://github.com/apache/iceberg/blob/ed0959257cba02f378f7097d81cecaaaef9fa43f/core/src/main/java/org/apache/iceberg/BaseDistributedDataScan.java#L149)
 is doing after reading from ManifestList. 
   2.  How it will help with SnapshotSummary ?
   
   > Engine like Spark doesn't get any benefits from these stats. Since it's 
parallelism is dynamic with runtime in nature. 
   
   > But other engines like Dremio which decides it's parallelism(during 
compiletime) in advance. Providing these stats will help for better 
parallelism.  
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add ManifestFile Stats in snapshot summary. [iceberg]

Reply via email to