Fokko commented on issue #2: URL: https://github.com/apache/iceberg-cpp/issues/2#issuecomment-2498210503
Hey everyone, and thanks @wgtmac for kickstarting the discussion. Sharing my thoughts below: ## Types I would also lean towards having a separate type system. Like @zeroshade already pointed out, for writing a decimal into Parquet, there are [certain mapping](https://github.com/apache/iceberg/blob/main/format/spec.md#parquet) that need to be followed according to the spec. Another issue that I ran into with PyIceberg, is the limited support for [Parquet Field-IDs](https://github.com/apache/parquet-format/blob/c70281359087dfaee8bd43bed9748675f4aabe11/src/main/thrift/parquet.thrift#L470-L473) in Arrow, this is something that Iceberg heavily relies on. For Arrow this is stored as a [binary field in the metadata](https://github.com/apache/iceberg-python/blob/cc1ab2c224123625d74962645cfef2886bfd9718/pyiceberg/io/pyarrow.py#L995-L1000), with Iceberg we often traverse over the schema would incur a lot of (de)serialization. Also, for the field, things like the initial- and write default need to be tracked, which is not part of Arrow currently. Therefore having schema primitives specifical ly for Iceberg makes it easier, and like Matt mentioned, it is easy to convert the one to the other. ## Format > data representation: row-wise (Avro record?) and columnar (arrow::RecordBatch?) I think we need both. Metadata is encoded in Avro, and for the data itself, the majority is in Parquet. Iceberg also supports Avro and ORC for storing data, but that's only being used by a fraction of the community. ## IO For IO there is an opinionated approach within Iceberg, called the `FileIO`: https://github.com/apache/iceberg/blob/f7ff0dc8c0a27e2bcd727e4f7705cf0a69ccc9b3/api/src/main/java/org/apache/iceberg/io/FileIO.java#L29-L36 This implements all the reading that's being used for Iceberg. One important distinction with a traditional filesystem is that it doesn't support listing and moving of files, therefore being very efficient to operate on an object store. I think we can wrap the `arrow::FileSystem` within a FileIO (similar to [what we do in PyIceberg](https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/pyarrow.py)), but I would strongly recommend also adopting this concept within Iceberg-CPP because it makes the integration much easier, for example, it also [standardizes the configuration](https://iceberg.apache.org/docs/nightly/configuration/) across implementations, and people could even provide their own FileIO through configuration if they like. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org