[GitHub] [iceberg-rust] ZENOTME commented on issue #34: Writer Design

via GitHub Fri, 18 Aug 2023 07:17:17 -0700


ZENOTME commented on issue #34:
URL: https://github.com/apache/iceberg-rust/issues/34#issuecomment-1683988347

> For parquet there is actually a great AsyncArrowWriter.

Good suggestion. I investigate
[metadata](https://docs.rs/parquet/latest/parquet/format/struct.FileMetaData.html)
return by the parquet writer and find that this metadata can fullfill most of
[datafile](https://iceberg.apache.org/spec/#manifests) need.
Except following field:
1. file_size_in_bytes
Metadata only record every row group size. But the file size also include
the metadata size. So we may can't compute by it.
The
[solution](https://github.com/icelake-io/icelake/blob/393d000f2e952bd32045da87a0d06d66047278f0/icelake/src/io/parquet/track_writer.rs#L10)
we use in icelake is to wrap the io writer with tracker.
2. nan_value_counts
If I understand correct, nan value only occur in the type like float,
double. So this messge may need we to track it from record batch manully.
3. split_offsets
> describe in spec. : Split offsets for the data file. For example, all row
group offsets in a Parquet file. Must be sorted ascending

The metadata only record the offset of the column chunk. I'm not sure can we
use the first column chunk offset as the row group offset.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg-rust] ZENOTME commented on issue #34: Writer Design

Reply via email to