panbamid-r opened a new issue, #1044:
URL: https://github.com/apache/iceberg-python/issues/1044

   ### Apache Iceberg version
   
   0.7.0 (latest release)
   
   ### Please describe the bug 🐞
   
   **TLDR**: Deleting based on a condition, on a partitioned iceberg table, 
yields incorrect results and deletes more records than expected.
   
   I've encountered the following issue while experimenting with pyiceberg, and 
trying to delete some records based on a condition on a partitioned table. I am 
using the latest version of pyiceberg (0.7.0) and I utilize the glue catalog.
   
   I have a pandas dataframe with 100 records, which is populated as such:
   * id: randomly, between 1 and 10,000
   * created_at: date, randomly between `2024-01-01` and `2024-02-01`
   * relevancy_score: randomly, between 0 and 1, following a beta distribution 
with alpha=2, beta=20.
   
   Having created the iceberg table:
   
   ```
   schema = Schema(
       NestedField(field_id=101, name="id", field_type=LongType(), 
required=True),
       NestedField(field_id=103, name="created_at", field_type=DateType(), 
required=False),
       NestedField(field_id=104, name="relevancy_score", 
field_type=DoubleType(), required=False),
   )
   
   partition_spec = PartitionSpec(
       PartitionField(
           source_id=103, field_id=2000, transform=DayTransform(), 
name="created_at_day"
       )
   )
   
   catalog.create_table(
       identifier=f"{glue_database_name}.{table_name}",
       schema=schema,
       partition_spec=partition_spec,
   )
   ```
   
   I append my df to the table:
   
   ```
   arrow_schema = iceberg_table.schema().as_arrow()
   docs_table = pa.Table.from_pandas(df, schema=arrow_schema)
   
   # Append the data to the Iceberg table
   iceberg_table.append(docs_table)
   ```
   
   and I get the following info:
   
   ```
   total_records = iceberg_table.scan().to_pandas().shape
   print(f"Total records after append: {total_records}")
   
   lower_before = iceberg_table.scan(row_filter=LessThan("relevancy_score", 
0.1)).to_pandas().shape
   print(f"Records with relevancy_score < 0.1 before deletion: {lower_before}")
   
   greater_before = 
iceberg_table.scan(row_filter=GreaterThanOrEqual("relevancy_score", 
0.1)).to_pandas().shape
   print(f"Records with relevancy_score >= 0.1 before deletion: 
{greater_before}")
   ```
   
   **Total records after append: (100, 3)
   Records with relevancy_score < 0.1 before deletion: (64, 3)
   Records with relevancy_score >= 0.1 before deletion: (36, 3)**
   
   so then I run the deletion:
   
   ```
   delete_condition = GreaterThanOrEqual("relevancy_score", 0.1)
   iceberg_table.delete(delete_condition)
   ```
   
   and the results are quite unexpected:
   
   ```
   # Verify the total number of records in the table after deletion
   remaining_records = iceberg_table.scan().to_pandas().shape
   print(f"Total records after deletion: {remaining_records}")
   
   # Verify the number of records that meet the relevancy_score condition after 
deletion
   lower_after = iceberg_table.scan(row_filter=LessThan("relevancy_score", 
0.1)).to_pandas().shape
   print(f"Records with relevancy_score < 0.1 after deletion: {lower_after}")
   
   greater_after = 
iceberg_table.scan(row_filter=GreaterThanOrEqual("relevancy_score", 
0.1)).to_pandas().shape
   print(f"Records with relevancy_score >= 0.1 after deletion: {greater_after}")
   ```
   
   **Total records after deletion: (26, 3)
   Records with relevancy_score < 0.1 after deletion: (26, 3)
   Records with relevancy_score >= 0.1 after deletion: (0, 3)**
   
   After checking the manifest files after the deletion what seems to be 
happening is:
   * The snapshot correctly keeps the files/days where there are no records 
where relevancy_score >= 0.1
   * The snapshot correctly marks as deleted the files/days where there are no 
records where relevancy_score < 0.1
   * The snapshot **seems to incorrectly ignore completely** files/days where 
there are both records with relevancy_score >= 0.1 **and** relevancy_score < 0.1
   
   For example, let's consider the following slice of the dataset:
   
   | id   | created_at | relevancy_score     |   |   |
   |------|------------|---------------------|---|---|
   | 808  | 2024-01-02 | 0.18115229995825877 |   |   |
   | 1424 | 2024-01-02 | 0.13995060752152988 |   |   |
   | 8026 | 2024-01-02 | 0.05992063209461162 |   |   |
   
   The `.avro` files of the manifest post-deletion, do not mention any data 
files of the `2024-01-02` partition whatsoever when it should have kept the 
record with id=8026
   
   I'm attaching an accompanying notebook which I used to recreate the issue
   I didn't encounter a similar issue when working with a non-partitioned table


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to