kevinjqliu commented on issue #1401: URL: https://github.com/apache/iceberg-python/issues/1401#issuecomment-2543350860
That makes sense to me. I think we generally need a place to replicate the [column projection logic according to the spec](https://iceberg.apache.org/spec/#column-projection). Currently, on the read path, the only projection done is to prune columns https://github.com/apache/iceberg-python/blob/a97d13c17cd03f86252b9df2c65532ec45fb05da/pyiceberg/io/pyarrow.py#L1246 > By comparing the projected schema vs the file projection schema yea the issue occurs when the table schema has fields that are not present in the file schema. From the spec: ``` Values for field ids which are not present in a data file must be resolved according the following rules ``` > Check if the data file partition struct contains that partition field (check by name) We don't need this extra check since the table/file schema mismatch will tell us which columns are missing. Also we'd always want to check by field id From the spec ``` Columns in Iceberg data files are selected by field id. ``` > Try to inject this new column in the resultant RecordBatch Yea we'd want to append whatever the value is to the data file records. Luckily arrow is columnar so there wont be much penalty. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org