[GitHub] [iceberg-rust] liurenjie1024 commented on issue #34: Writer Design

via GitHub Sun, 20 Aug 2023 20:54:23 -0700


liurenjie1024 commented on issue #34:
URL: https://github.com/apache/iceberg-rust/issues/34#issuecomment-1685599805


   Thanks everyone for this nice discussion. 
   
   For parquet file metadata, I've submitted a 
[pr](https://github.com/apache/arrow-rs/pull/4636/files) to fix it, so we will 
have `file_size_in_bytes` and `split_offset` in parquet file metadata.
   
   About metrics such as `nan_values`, I think we can follow java's approach: 
https://github.com/apache/iceberg/blob/aa1c1ef49eb0cd969c18676495983f5b6a231a5c/parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java#L140
   
   In iceberg java's design, there are several components:
   
   1. 
[FileIO](https://github.com/apache/iceberg/blob/c07f2aabc0a1d02f068ecf1514d2479c0fbdd3b0/api/src/main/java/org/apache/iceberg/io/FileIO.java#L32)
   
   It provides several methods for manipulating files in underlying storage, 
such as creating a new file for writing, deleting a file. I think we can 
provide similar data structure, which can be implemented as a wrapper of 
underlying library(opendal, etc). 
   
   ```
   impl FileIO {
     fn new_file_writer(&self, path: &str) -> impl Writer;
     fn new_async_file_writer(&self, path: &str) -> impl AynscWriter;
   }
   ```
   
   3. 
[FileAppender](https://github.com/apache/iceberg/blob/c07f2aabc0a1d02f068ecf1514d2479c0fbdd3b0/api/src/main/java/org/apache/iceberg/io/FileAppender.java#L26)
   
   A file appender focus on **one file, and the format of this file**. We can 
have different file appender for different formats, such as parquet, orc, avro.
   
   4. 
[FileWriter](https://github.com/apache/iceberg/blob/c07f2aabc0a1d02f068ecf1514d2479c0fbdd3b0/core/src/main/java/org/apache/iceberg/io/FileWriter.java#L37)
   
   A `FileWriter` focuses on content of a file, for example, we can have data 
file writer, pos deletion file writer, equation deletion file writer. Usually a 
file writer cares about one partition data.
   
   5. 
[TaskWriter](https://github.com/apache/iceberg/blob/c07f2aabc0a1d02f068ecf1514d2479c0fbdd3b0/core/src/main/java/org/apache/iceberg/io/TaskWriter.java#L31)
   
   A task writer works is used by a task in distributed computing framework, 
such as spark, flink. A task writer takes care of assigning input data into 
different partitions, and calls `FileWriter` to append data. Also for deletions 
it should call deletion file writer.
   
   I think above design is quite elegant, and I would suggest similar 
components as it.
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg-rust] liurenjie1024 commented on issue #34: Writer Design

Reply via email to