dannycjones commented on code in PR #2367:
URL: https://github.com/apache/iceberg-rust/pull/2367#discussion_r3312416700
##########
crates/iceberg/src/transaction/snapshot.rs:
##########
@@ -375,11 +457,50 @@ impl<'a> SnapshotProducer<'a> {
summary_collector.set_partition_summary_limit(partition_summary_limit);
+ // Helper: look up the partition spec for a file. Returns DataInvalid
+ // if the file references a spec that doesn't exist in the table.
+ let spec_for = |data_file: &DataFile| {
+ table_metadata
+ .partition_spec_by_id(data_file.partition_spec_id)
+ .cloned()
+ .ok_or_else(|| {
+ Error::new(
+ ErrorKind::DataInvalid,
+ format!(
+ "Cannot find partition spec {} for file: {}",
+ data_file.partition_spec_id,
+ data_file.file_path()
+ ),
+ )
+ })
+ };
+
for data_file in &self.added_data_files {
summary_collector.add_file(
data_file,
table_metadata.current_schema().clone(),
- table_metadata.default_partition_spec().clone(),
+ spec_for(data_file)?,
+ );
+ }
+ for delete_file in &self.added_delete_files {
+ summary_collector.add_file(
+ delete_file,
+ table_metadata.current_schema().clone(),
+ spec_for(delete_file)?,
+ );
+ }
+ for data_file in &self.removed_data_files {
+ summary_collector.remove_file(
+ data_file,
+ table_metadata.current_schema().clone(),
+ spec_for(data_file)?,
+ );
+ }
+ for delete_file in &self.removed_delete_files {
+ summary_collector.remove_file(
+ delete_file,
+ table_metadata.current_schema().clone(),
+ spec_for(delete_file)?,
);
}
Review Comment:
I think fixing this will be an involved change, assuming it is wrong. Since
the stats are optional, shall we omit for now and open an issue to follow-up?
Looking at the Java implementation, [it stores a reference to the schema for
a given partition
spec](https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L57).
Perhaps we should consider doing the same. This would allow us to always be
able to pull up the associated partitions spec, and I'm hoping that it would
simplify things a bit since we would not need to pass in a schema to the
snapshot summary collector.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]