Re: [PR] Add ManifestFile Stats in snapshot summary. [iceberg]

via GitHub Mon, 12 Aug 2024 05:15:25 -0700


Fokko commented on code in PR #10246:
URL: https://github.com/apache/iceberg/pull/10246#discussion_r1713644175



##########
core/src/main/java/org/apache/iceberg/FastAppend.java:
##########
@@ -156,6 +156,8 @@ public List<ManifestFile> apply(TableMetadata base, 
Snapshot snapshot) {
       manifests.addAll(snapshot.allManifests(ops.io()));
     }
 
+    manifests.forEach(summaryBuilder::addedManifestStats);

Review Comment:
   @ajantha-bhat I'm still thinking the argument of having this information 
helping the query planning is quite thin. I don't think you can get away with 
reading the manifest list for doing some meaningful query planning as the size 
of the manifests varies wildly. Thinking of it, another issue can be that the 
manifest is not live, meaning it only contains deleted manifest-entries in a 
certain manifest file. You'll get all this information when you read the 
manifest list.
   
   > Agree that Having size based cost estimation will be more accurate. But 
count based estimation is still better than no stats.
   
   Everything comes at a price. The snapshots are already a substantial portion 
of the table metadata, and users are already running into issues when the 
number of snapshots becomes too large.
   
   Looping in @aokolnychyi in here as well to get his opinion since he did a 
lot of work on performance optimization



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add ManifestFile Stats in snapshot summary. [iceberg]

Reply via email to