syun64 commented on PR #582: URL: https://github.com/apache/iceberg-python/pull/582#issuecomment-2043441654
Hi @Fokko @adrianqin I think the goal of this PR is to create a distinction to the semantic of a 'static overwrite' onto a partitioned table, from that of a 'delete' + 'append'. As we discussed in the last [PyIceberg Sync](https://docs.google.com/document/d/1FnWwTJ5VBKdsvklPxt4r1Lb06DB2fFMJG3HITeafrU8/edit#heading=h.g2i9mi9crf0m), if we believe that a partitioned table overwrite should maintain the expectation of comparing the values provided in the overwrite_filter, to the values provided in the arrow table, I think we'd need these validations. We would first need to run these validations on the predicate expression provided in the overwrite_filter so that we can then compare the values in the arrow table. In the community sync, we discussed whether the community was in favor of drawing a distinction between `delete + append`, versus `overwrite`, and I think the we all gravitated _somewhat_ towards the latter. For example, where dt and level are both partition columns: ``` overwrite_filter = "level = 'INFO' AND dt = '2024-03-25'" # expected behavior -> deletes partition level = 'INFO' and dt = '2024-03-26' df = pa.Table.from_pylist( [ {"level": "INFO", "msg": "hi", "dt": date(2024, 3, 26)}, {"level": "ERROR", "msg": "bye", "dt": date(2024, 3, 26)}, ], ) tbl.overwrite(df, overwrite_filter) ``` If we wanted to handle the validation only in the `delete` function by checking if we would end up rewriting files, above pattern would succeed by deleting level = 'INFO' and dt = '2024-02-01' because these are pure metadata operations. And then we would add new data files for level = 'INFO' and dt = '2024-03-26' & level = 'ERROR' and dt = '2024-03-26'. Static overwrite on the other hand, would eagerly validate the predicate expression against the table schema, and the values in the arrow table and throw instead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org