dentiny opened a new issue, #1284: URL: https://github.com/apache/iceberg-rust/issues/1284
### What's the feature are you trying to implement? It's commonly requested that people want to upload / export their parquet file into iceberg table as the initial version; whether it's shallow copy (i.e. no file movement or copy), or deep copy. One related issue I could is pg_mooncake mentions certain shallow copy on their roadmap: https://github.com/Mooncake-Labs/pg_mooncake/issues/61 Right now, I don't find this feature supported by [`DataFileWriter`](https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/writer/base_writer/data_file_writer.rs), so the workaround I've done is to read parquet files into arrow record batches, and write with parquet file writer one by one. I expect to have two interfaces in iceberg-rust: ```rust /// Shallow copy parquet files into iceberg table, and return data files to reference in iceberg table. pub fn reference_parquet_files(parquet_filepaths: Vec<String>) -> Vec<DataFile>; /// Deep copy parquet files into iceberg table with the given target locations, and return data files to reference in iceberg table. pub fn copy_parquet_files(src_parquet_filepaths: Vec<String>, dst_parquet_filepaths: Vec<String>) -> Vec<DataFile>; ``` There're two benefits here: - Save the arrow batch (de)serialization and IO operation, only metadata interpretation is needed + For example, stats like `null_value_counts` is already contained in parquet page header - High-level user-friendly interface for parquet files export I tried to read parquet metadata and copy via opendal myself, but there're two blockers (for me): - Interpreting parquet metadata into iceberg `DataFile` requires certain amount of work - All fields in `DataFile` are only visible inside of crate ### Willingness to contribute I would be willing to contribute to this feature with guidance from the Iceberg Rust community -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org