Re: [PR] feat: Add existing parquet files [iceberg-rust]

via GitHub Mon, 10 Feb 2025 00:50:56 -0800


Fokko commented on PR #960:
URL: https://github.com/apache/iceberg-rust/pull/960#issuecomment-2647304167

PyIceberg and Iceberg-Java are a bit different. Where PyIceberg is used by
end-users, Iceberg-Java is often embedded in a query engine. I think this is
the reason why it isn't part of the transaction API. Spark does [have an
`add_files`](https://iceberg.apache.org/docs/nightly/spark-procedures/#add_files)
procedure.

To successfully be able to add files to a table, I think three things are
essential:

- *Schema* As already mentioned, the schema should be either the same or
compatible. I would start with the first to make it simple and robust.
- *Name Mapping* Since the Parquet probably doesn't contain field IDs for
column tracking, we need to fall back on
[name-mapping](https://iceberg.apache.org/spec/?column-projection#name-mapping-serialization).
- *Metrics* When adding a file to the table, we should extract the
upper-lower bound, number of nulls, etc from the Parquet footer and store it in
the Iceberg metadata. This is important for Iceberg to maintain its promise of
doing efficient scan's. Without this information, the file would always be
included when planning a query.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] feat: Add existing parquet files [iceberg-rust]

Reply via email to