panbamid-r opened a new issue, #1044: URL: https://github.com/apache/iceberg-python/issues/1044
### Apache Iceberg version 0.7.0 (latest release) ### Please describe the bug 🐞 **TLDR**: Deleting based on a condition, on a partitioned iceberg table, yields incorrect results and deletes more records than expected. I've encountered the following issue while experimenting with pyiceberg, and trying to delete some records based on a condition on a partitioned table. I am using the latest version of pyiceberg (0.7.0) and I utilize the glue catalog. I have a pandas dataframe with 100 records, which is populated as such: * id: randomly, between 1 and 10,000 * created_at: date, randomly between `2024-01-01` and `2024-02-01` * relevancy_score: randomly, between 0 and 1, following a beta distribution with alpha=2, beta=20. Having created the iceberg table: ``` schema = Schema( NestedField(field_id=101, name="id", field_type=LongType(), required=True), NestedField(field_id=103, name="created_at", field_type=DateType(), required=False), NestedField(field_id=104, name="relevancy_score", field_type=DoubleType(), required=False), ) partition_spec = PartitionSpec( PartitionField( source_id=103, field_id=2000, transform=DayTransform(), name="created_at_day" ) ) catalog.create_table( identifier=f"{glue_database_name}.{table_name}", schema=schema, partition_spec=partition_spec, ) ``` I append my df to the table: ``` arrow_schema = iceberg_table.schema().as_arrow() docs_table = pa.Table.from_pandas(df, schema=arrow_schema) # Append the data to the Iceberg table iceberg_table.append(docs_table) ``` and I get the following info: ``` total_records = iceberg_table.scan().to_pandas().shape print(f"Total records after append: {total_records}") lower_before = iceberg_table.scan(row_filter=LessThan("relevancy_score", 0.1)).to_pandas().shape print(f"Records with relevancy_score < 0.1 before deletion: {lower_before}") greater_before = iceberg_table.scan(row_filter=GreaterThanOrEqual("relevancy_score", 0.1)).to_pandas().shape print(f"Records with relevancy_score >= 0.1 before deletion: {greater_before}") ``` **Total records after append: (100, 3) Records with relevancy_score < 0.1 before deletion: (64, 3) Records with relevancy_score >= 0.1 before deletion: (36, 3)** so then I run the deletion: ``` delete_condition = GreaterThanOrEqual("relevancy_score", 0.1) iceberg_table.delete(delete_condition) ``` and the results are quite unexpected: ``` # Verify the total number of records in the table after deletion remaining_records = iceberg_table.scan().to_pandas().shape print(f"Total records after deletion: {remaining_records}") # Verify the number of records that meet the relevancy_score condition after deletion lower_after = iceberg_table.scan(row_filter=LessThan("relevancy_score", 0.1)).to_pandas().shape print(f"Records with relevancy_score < 0.1 after deletion: {lower_after}") greater_after = iceberg_table.scan(row_filter=GreaterThanOrEqual("relevancy_score", 0.1)).to_pandas().shape print(f"Records with relevancy_score >= 0.1 after deletion: {greater_after}") ``` **Total records after deletion: (26, 3) Records with relevancy_score < 0.1 after deletion: (26, 3) Records with relevancy_score >= 0.1 after deletion: (0, 3)** After checking the manifest files after the deletion what seems to be happening is: * The snapshot correctly keeps the files/days where there are no records where relevancy_score >= 0.1 * The snapshot correctly marks as deleted the files/days where there are no records where relevancy_score < 0.1 * The snapshot **seems to incorrectly ignore completely** files/days where there are both records with relevancy_score >= 0.1 **and** relevancy_score < 0.1 For example, let's consider the following slice of the dataset: | id | created_at | relevancy_score | | | |------|------------|---------------------|---|---| | 808 | 2024-01-02 | 0.18115229995825877 | | | | 1424 | 2024-01-02 | 0.13995060752152988 | | | | 8026 | 2024-01-02 | 0.05992063209461162 | | | The `.avro` files of the manifest post-deletion, do not mention any data files of the `2024-01-02` partition whatsoever when it should have kept the record with id=8026 I'm attaching an accompanying notebook which I used to recreate the issue I didn't encounter a similar issue when working with a non-partitioned table -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org