Fokko commented on PR #960:
URL: https://github.com/apache/iceberg-rust/pull/960#issuecomment-2647304167

   PyIceberg and Iceberg-Java are a bit different. Where PyIceberg is used by 
end-users, Iceberg-Java is often embedded in a query engine. I think this is 
the reason why it isn't part of the transaction API. Spark does [have an 
`add_files`](https://iceberg.apache.org/docs/nightly/spark-procedures/#add_files)
 procedure.
   
   To successfully be able to add files to a table, I think three things are 
essential:
   
   - *Schema* As already mentioned, the schema should be either the same or 
compatible. I would start with the first to make it simple and robust.
   - *Name Mapping* Since the Parquet probably doesn't contain field IDs for 
column tracking, we need to fall back on 
[name-mapping](https://iceberg.apache.org/spec/?column-projection#name-mapping-serialization).
   - *Metrics* When adding a file to the table, we should extract the 
upper-lower bound, number of nulls, etc from the Parquet footer and store it in 
the Iceberg metadata. This is important for Iceberg to maintain its promise of 
doing efficient scan's. Without this information, the file would always be 
included when planning a query.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to