liurenjie1024 commented on issue #34: URL: https://github.com/apache/iceberg-rust/issues/34#issuecomment-1685599805
Thanks everyone for this nice discussion. For parquet file metadata, I've submitted a [pr](https://github.com/apache/arrow-rs/pull/4636/files) to fix it, so we will have `file_size_in_bytes` and `split_offset` in parquet file metadata. About metrics such as `nan_values`, I think we can follow java's approach: https://github.com/apache/iceberg/blob/aa1c1ef49eb0cd969c18676495983f5b6a231a5c/parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java#L140 In iceberg java's design, there are several components: 1. [FileIO](https://github.com/apache/iceberg/blob/c07f2aabc0a1d02f068ecf1514d2479c0fbdd3b0/api/src/main/java/org/apache/iceberg/io/FileIO.java#L32) It provides several methods for manipulating files in underlying storage, such as creating a new file for writing, deleting a file. I think we can provide similar data structure, which can be implemented as a wrapper of underlying library(opendal, etc). ``` impl FileIO { fn new_file_writer(&self, path: &str) -> impl Writer; fn new_async_file_writer(&self, path: &str) -> impl AynscWriter; } ``` 3. [FileAppender](https://github.com/apache/iceberg/blob/c07f2aabc0a1d02f068ecf1514d2479c0fbdd3b0/api/src/main/java/org/apache/iceberg/io/FileAppender.java#L26) A file appender focus on **one file, and the format of this file**. We can have different file appender for different formats, such as parquet, orc, avro. 4. [FileWriter](https://github.com/apache/iceberg/blob/c07f2aabc0a1d02f068ecf1514d2479c0fbdd3b0/core/src/main/java/org/apache/iceberg/io/FileWriter.java#L37) A `FileWriter` focuses on content of a file, for example, we can have data file writer, pos deletion file writer, equation deletion file writer. Usually a file writer cares about one partition data. 5. [TaskWriter](https://github.com/apache/iceberg/blob/c07f2aabc0a1d02f068ecf1514d2479c0fbdd3b0/core/src/main/java/org/apache/iceberg/io/TaskWriter.java#L31) A task writer works is used by a task in distributed computing framework, such as spark, flink. A task writer takes care of assigning input data into different partitions, and calls `FileWriter` to append data. Also for deletions it should call deletion file writer. I think above design is quite elegant, and I would suggest similar components as it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
