syun64 commented on PR #582:
URL: https://github.com/apache/iceberg-python/pull/582#issuecomment-2043441654

   Hi @Fokko @adrianqin I think the goal of this PR is to create a distinction 
to the semantic of a 'static overwrite' onto a partitioned table, from that of 
a 'delete' + 'append'.
   
   As we discussed in the last [PyIceberg 
Sync](https://docs.google.com/document/d/1FnWwTJ5VBKdsvklPxt4r1Lb06DB2fFMJG3HITeafrU8/edit#heading=h.g2i9mi9crf0m),
 if we believe that a partitioned table overwrite should maintain the 
expectation of comparing the values provided in the overwrite_filter, to the 
values provided in the arrow table, I think we'd need these validations. We 
would first need to run these validations on the predicate expression provided 
in the overwrite_filter so that we can then compare the values in the arrow 
table. 
   
   In the community sync, we discussed whether the community was in favor of 
drawing a distinction between `delete + append`, versus `overwrite`, and I 
think the we all gravitated _somewhat_ towards the latter.
   
   For example, where dt and level are both partition columns:
   ```
   overwrite_filter = "level = 'INFO' AND dt = '2024-03-25'"
   
   # expected behavior -> deletes partition level = 'INFO' and dt = '2024-03-26'
   
   df = pa.Table.from_pylist(
      [
          {"level": "INFO", "msg": "hi", "dt": date(2024, 3, 26)},
          {"level": "ERROR", "msg": "bye", "dt": date(2024, 3, 26)},
      ],
   )
   
   tbl.overwrite(df, overwrite_filter)
   ```
   
   If we wanted to handle the validation only in the `delete` function by 
checking if we would end up rewriting files, above pattern would succeed by 
deleting level = 'INFO' and dt = '2024-02-01' because these are pure metadata 
operations. And then we would add new data files for level = 'INFO' and dt = 
'2024-03-26' & level = 'ERROR' and dt = '2024-03-26'.
   
   Static overwrite on the other hand, would eagerly validate the predicate 
expression against the table schema, and the values in the arrow table and 
throw instead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to