Re: [I] [DISCUSSION] Project Goal [iceberg-cpp]

via GitHub Mon, 25 Nov 2024 06:42:49 -0800


Fokko commented on issue #2:
URL: https://github.com/apache/iceberg-cpp/issues/2#issuecomment-2498210503


   Hey everyone, and thanks @wgtmac for kickstarting the discussion. Sharing my 
thoughts below:
   
   ## Types
   
   I would also lean towards having a separate type system. Like @zeroshade 
already pointed out, for writing a decimal into Parquet, there are [certain 
mapping](https://github.com/apache/iceberg/blob/main/format/spec.md#parquet) 
that need to be followed according to the spec. Another issue that I ran into 
with PyIceberg, is the limited support for [Parquet 
Field-IDs](https://github.com/apache/parquet-format/blob/c70281359087dfaee8bd43bed9748675f4aabe11/src/main/thrift/parquet.thrift#L470-L473)
 in Arrow, this is something that Iceberg heavily relies on. For Arrow this is 
stored as a [binary field in the 
metadata](https://github.com/apache/iceberg-python/blob/cc1ab2c224123625d74962645cfef2886bfd9718/pyiceberg/io/pyarrow.py#L995-L1000),
 with Iceberg we often traverse over the schema would incur a lot of 
(de)serialization. Also, for the field, things like the initial- and write 
default need to be tracked, which is not part of Arrow currently. Therefore 
having schema primitives specifical
 ly for Iceberg makes it easier, and like Matt mentioned, it is easy to convert 
the one to the other.
   
   ## Format
   
   > data representation: row-wise (Avro record?) and columnar 
(arrow::RecordBatch?)
   
   I think we need both. Metadata is encoded in Avro, and for the data itself, 
the majority is in Parquet. Iceberg also supports Avro and ORC for storing 
data, but that's only being used by a fraction of the community.
   
   ## IO
   
   For IO there is an opinionated approach within Iceberg, called the `FileIO`: 
https://github.com/apache/iceberg/blob/f7ff0dc8c0a27e2bcd727e4f7705cf0a69ccc9b3/api/src/main/java/org/apache/iceberg/io/FileIO.java#L29-L36
   
   This implements all the reading that's being used for Iceberg. One important 
distinction with a traditional filesystem is that it doesn't support listing 
and moving of files, therefore being very efficient to operate on an object 
store. I think we can wrap the `arrow::FileSystem` within a FileIO (similar to 
[what we do in 
PyIceberg](https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/pyarrow.py)),
 but I would strongly recommend also adopting this concept within Iceberg-CPP 
because it makes the integration much easier, for example, it also 
[standardizes the 
configuration](https://iceberg.apache.org/docs/nightly/configuration/) across 
implementations, and people could even provide their own FileIO through 
configuration if they like.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] [DISCUSSION] Project Goal [iceberg-cpp]

Reply via email to