Re: [PR] Validate overwrite filter [iceberg-python]

via GitHub Mon, 08 Apr 2024 01:25:37 -0700


Fokko commented on PR #582:
URL: https://github.com/apache/iceberg-python/pull/582#issuecomment-2042152141


   Hi Adrian, thanks for working on this and the very comprehensive write-up. 
My first questions is, what is the main goal of this PR.
   
   Let me elaborate with more context. Looking at the Spark syntax, this 
originated from the Hive era, and assumes that there is a single partition spec 
on the table. With Iceberg, the partitioning can evolve over time and therefore 
also older partitioning strategies can be present.
   
   ```sql
   CREATE TABLE prod.my_app.logs (
        env        STRING,
        logline    STRING,
        created_at TIMESTAMP_NTZ -- Nobody likes timezones
   ) 
   USING iceberg
   PARTITION BY (months(created_at))
   
   -- Later on when there is more data
   REPLACE PARTITION FIELD dt_month WITH days(days(created_at))
   
   -- Or, when we want to split the logs per environment
   REPLACE PARTITION FIELD dt_month WITH days(env, months(created_at))
   ```
   
   There is a fair chance that we have to rewrite actual data. When updating 
the spec, a full rewrite of the table is not done directly. Otherwise it would 
be very costly to update the partition spec on big tables. If the data is still 
partitioned monthly, update data using the new partitioning (on a daily basis), 
it will read in the Parquet files that match the filter. Let's say that you do: 
`INSERT OVERWRITE PARTITION (created_at='2024-04-08')` it will read in data 
that has been written into a monthly partitioning. It will filter out the data 
for `2024-04-08` and write back the remaining data. The new data will be 
appended to the table using the new partitioning scheme.
   
   > Example spark sql counterpart in iceberg spark static overwrite for full 
table overwrite is
   
   I think the top one is a [dynamic 
overwrite](https://iceberg.apache.org/docs/latest/spark-writes/#dynamic-overwrite)
 since it does not explicitly calls out the partition that it will overwrite.
   
   In that case we should compute all the affected partition values by applying 
the current partition spec on the dataframe. Based on that we can first delete 
the affected partitions, and then append the data.
   
   ```python
   def overwrite_dynamic(df: pa.Table) -> None:
   ```
   
   The second one looks like a static overwrite:
   
   ```python
   def overwrite(df: pa.Table, expr: Union[BooleanExpression, str]) -> None:
       ...
   
   
   overwrite(df, "level = 'INFO'")
   ```
   
   I agree that this is a bit odd:
   
   ```
   mismatched input '>' expecting {')', ','}(line 3, pos 22)
   
   == SQL ==
   INSERT OVERWRITE
       
lacus.test.spark_staticoverwrite_partition_clause_and_data_reltship_multicols
       PARTITION (n_legs > 2)
   ----------------------^^^
       SELECT color,animals FROM  tmp_view
   ```
   
   Open questions:
   
   - But in Iceberg there is no technical reason to now allow this. Do we want 
to refrain the user from doing this?
   - Should we add an option that will blow up when we have to rewrite files. 
If you do your predicates correctly, and you don't evolve the partition spec, 
then no parquet files should be opened and your overwrites will be crazy 
efficient.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Validate overwrite filter [iceberg-python]

Reply via email to