dannycjones commented on code in PR #2367:
URL: https://github.com/apache/iceberg-rust/pull/2367#discussion_r3312024075
##########
crates/iceberg/src/transaction/snapshot.rs:
##########
@@ -223,7 +221,11 @@ impl<'a> SnapshotProducer<'a> {
snapshot_id
}
- fn new_manifest_writer(&mut self, content: ManifestContentType) ->
Result<ManifestWriter> {
+ pub(crate) fn new_manifest_writer(
+ &mut self,
+ content: ManifestContentType,
+ spec_id: i32,
+ ) -> Result<ManifestWriter> {
Review Comment:
Why do we need to promote from module-private to crate-private?
If we do make it crate-private, consider adding Rustdoc?
##########
crates/iceberg/src/transaction/snapshot.rs:
##########
@@ -312,13 +325,69 @@ impl<'a> SnapshotProducer<'a> {
builder.build()
}
});
- let mut writer = self.new_manifest_writer(ManifestContentType::Data)?;
+ let mut writer = self.new_manifest_writer(
+ content_type,
+ self.table.metadata().default_partition_spec_id(),
+ )?;
for entry in manifest_entries {
writer.add_entry(entry)?;
}
writer.write_manifest_file().await
}
+ async fn write_deleted_manifest(
+ &mut self,
+ deleted_entries: Vec<ManifestEntry>,
+ ) -> Result<Vec<ManifestFile>> {
Review Comment:
I'd generally consider Rustdoc here even if its not `pub` since it is still
surfaced by rust-analyzer when working within the module.
##########
crates/iceberg/src/transaction/snapshot.rs:
##########
@@ -375,11 +457,50 @@ impl<'a> SnapshotProducer<'a> {
summary_collector.set_partition_summary_limit(partition_summary_limit);
+ // Helper: look up the partition spec for a file. Returns DataInvalid
+ // if the file references a spec that doesn't exist in the table.
+ let spec_for = |data_file: &DataFile| {
+ table_metadata
+ .partition_spec_by_id(data_file.partition_spec_id)
+ .cloned()
+ .ok_or_else(|| {
+ Error::new(
+ ErrorKind::DataInvalid,
+ format!(
+ "Cannot find partition spec {} for file: {}",
+ data_file.partition_spec_id,
+ data_file.file_path()
+ ),
+ )
+ })
+ };
+
for data_file in &self.added_data_files {
summary_collector.add_file(
data_file,
table_metadata.current_schema().clone(),
- table_metadata.default_partition_spec().clone(),
+ spec_for(data_file)?,
+ );
+ }
+ for delete_file in &self.added_delete_files {
+ summary_collector.add_file(
+ delete_file,
+ table_metadata.current_schema().clone(),
+ spec_for(delete_file)?,
+ );
+ }
+ for data_file in &self.removed_data_files {
+ summary_collector.remove_file(
+ data_file,
+ table_metadata.current_schema().clone(),
+ spec_for(data_file)?,
+ );
+ }
+ for delete_file in &self.removed_delete_files {
+ summary_collector.remove_file(
+ delete_file,
+ table_metadata.current_schema().clone(),
+ spec_for(delete_file)?,
);
}
Review Comment:
I'm doubtful if we should be using current table schema here. Why not the
schema associated with the file?
I need to read up more, but my concern is around if the partition-specific
summaries will correctly evaluate. What happens if a removed file is
partitioned using fields that are no longer in the current schema?
##########
crates/iceberg/src/transaction/snapshot.rs:
##########
@@ -223,7 +221,11 @@ impl<'a> SnapshotProducer<'a> {
snapshot_id
}
- fn new_manifest_writer(&mut self, content: ManifestContentType) ->
Result<ManifestWriter> {
+ pub(crate) fn new_manifest_writer(
+ &mut self,
+ content: ManifestContentType,
+ spec_id: i32,
+ ) -> Result<ManifestWriter> {
Review Comment:
I see that we use it here:
https://github.com/apache/iceberg-rust/pull/1606/changes#diff-1b16dcf937ab5887555a43dde25c2ad508b09a0cc38ac215f008c6d2774954d4R337-R342
However, I don't understand yet what makes this different from, for example,
an append.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]