ZENOTME commented on issue #34: URL: https://github.com/apache/iceberg-rust/issues/34#issuecomment-1683988347
> For parquet there is actually a great AsyncArrowWriter. Good suggestion. I investigate [metadata](https://docs.rs/parquet/latest/parquet/format/struct.FileMetaData.html) return by the parquet writer and find that this metadata can fullfill most of [datafile](https://iceberg.apache.org/spec/#manifests) need. Except following field: 1. file_size_in_bytes Metadata only record every row group size. But the file size also include the metadata size. So we may can't compute by it. The [solution](https://github.com/icelake-io/icelake/blob/393d000f2e952bd32045da87a0d06d66047278f0/icelake/src/io/parquet/track_writer.rs#L10) we use in icelake is to wrap the io writer with tracker. 2. nan_value_counts If I understand correct, nan value only occur in the type like float, double. So this messge may need we to track it from record batch manully. 3. split_offsets > describe in spec. : Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending The metadata only record the offset of the column chunk. I'm not sure can we use the first column chunk offset as the row group offset. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
