Re: [PR] feat: Add existing parquet files [iceberg-rust]

via GitHub Tue, 18 Feb 2025 01:15:41 -0800


liurenjie1024 commented on code in PR #960:
URL: https://github.com/apache/iceberg-rust/pull/960#discussion_r1959197859



##########
crates/iceberg/src/transaction.rs:
##########
@@ -205,6 +252,64 @@ impl<'a> FastAppendAction<'a> {
         Ok(self)
     }
 
+    /// Adds existing parquet files
+    pub async fn add_parquet_files(
+        transaction: Transaction<'a>,
+        file_paths: Vec<String>,

Review Comment:
   ```suggestion
           &mut self,
           file_path: &str
   ```
   
   As an api, I would suggest to ask user to provid one filename only so that 
user could do it concurrently.



##########
crates/iceberg/src/writer/file_writer/parquet_writer.rs:
##########
@@ -219,14 +229,15 @@ pub struct ParquetWriter {
 }
 
 /// Used to aggregate min and max value of each column.
-struct MinMaxColAggregator {
+pub struct MinMaxColAggregator {

Review Comment:
   We don't need to make this public?



##########
crates/iceberg/src/transaction.rs:
##########
@@ -205,6 +252,64 @@ impl<'a> FastAppendAction<'a> {
         Ok(self)
     }
 
+    /// Adds existing parquet files
+    pub async fn add_parquet_files(
+        transaction: Transaction<'a>,
+        file_paths: Vec<String>,
+    ) -> Result<Transaction<'a>> {
+        // Checks duplicate files
+        let new_files: HashSet<&str> = file_paths.iter().map(|s| 
s.as_str()).collect();
+
+        let mut manifest_stream = 
transaction.table.inspect().manifests().scan().await?;
+        let mut referenced_files = Vec::new();
+
+        while let Some(batch) = manifest_stream.try_next().await? {
+            let file_path_array = batch
+                .column(1)
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .ok_or_else(|| {
+                    Error::new(
+                        ErrorKind::DataInvalid,
+                        "Failed to downcast file_path column to StringArray",
+                    )
+                })?;
+
+            for i in 0..batch.num_rows() {
+                let file_path = file_path_array.value(i);
+                if new_files.contains(file_path) {
+                    referenced_files.push(file_path.to_string());
+                }
+            }
+        }
+
+        if !referenced_files.is_empty() {
+            return Err(Error::new(
+                ErrorKind::DataInvalid,
+                format!(
+                    "Cannot add files that are already referenced by table, 
files: {}",
+                    referenced_files.join(", ")
+                ),

Review Comment:
   We should not check duplicated file here, we should do it in 
`FastAppendOperation`'s apply method, otherwise we need to load all manifest 
files everytime user calls this method.



##########
crates/iceberg/src/writer/file_writer/parquet_writer.rs:
##########
@@ -391,6 +404,79 @@ impl ParquetWriter {
             );
         Ok(builder)
     }
+
+    /// `ParquetMetadata` to data file builder
+    pub fn parquet_to_data_file_builder(
+        schema: SchemaRef,
+        metadata: Arc<ParquetMetaData>,

Review Comment:
   I don't get why we need to duplicate so much code with 
`ParquetWriter::to_data_file_builder`, how about simply extracting 
`FileMetaData` from `ParquetMetaData` and calling `to_data_file_builder`?



##########
crates/iceberg/src/writer/file_writer/mod.rs:
##########
@@ -24,7 +24,7 @@ use super::CurrentFileStatus;
 use crate::spec::DataFileBuilder;
 use crate::Result;
 
-mod parquet_writer;
+pub mod parquet_writer;

Review Comment:
   Why we need to expose this? I think `ParquetWriter` has already been exposed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] feat: Add existing parquet files [iceberg-rust]

Reply via email to