thomas-pfeiffer opened a new issue, #14735:
URL: https://github.com/apache/iceberg/issues/14735

   ### Apache Iceberg version
   
   1.9.2
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   **Background:**
   We have a Iceberg table that we write / append to very concurrently via AWS 
Lambda. We use AWS Glue 5.0 with PySpark and Iceberg 1.9.2 to do maintenance 
and house keeping tasks on a regular basis. Our Glue script mainly triggers the 
following Spark stored procedures:
   - delete duplicate entries (custom Spark SQL)
   - `glue_catalog.system.rewrite_position_delete_files`
   - `glue_catalog.system.rewrite_data_files`
   - `glue_catalog.system.expire_snapshots`
   - `glue_catalog.system.remove_orphan_files`
   - `glue_catalog.system.compute_table_stats`
   
   **Issue (observed behaviour):**
   When there are a lot of concurrent writes, the `compute_table_stats` 
procedure fails with a CommitFailedException:
   ```
   org.apache.iceberg.exceptions.CommitFailedException: Cannot commit 
glue_catalog.{database name}.{table_name} because base metadata location 
's3://{s3_bucket_name}/{database 
name}.db/{table_name}/metadata/07445-f1e79608-66bf-4d0a-a771-1a980f5a381a.metadata.json'
 is not same as the current Glue location 's3://{s3_bucket_name}/{database 
name}.db/{table_name}/metadata/07449-84c55277-f396-47c7-933d-9829c57ca0f3.metadata.json'
   ```
   The Glue script incl. `compute_table_stats` finishes successfully, when 
there are no concurrent writes.
   
   **Expected behaviour:**
   Naively we would expect, that the `compute_table_stats` procedure would not 
fail, when there are concurrent writes happening. I guess, it should handle the 
exception internally and finish regardless (but pot. slower due to some 
retries).
   
   **Remarks:**
   - It's a bit confusing to us, that only the `compute_table_stats` procedure 
has this issue. The other procedures ran through (seemingly) regardless of 
concurrent writes.
   - We only specify the `table` parameter currently. `snapshot_id` we do not 
set currently. `columns` is also unset, since we want statistics on all columns.
   - We set the table properties for retries, but we're not sure if the 
properties are leverage by the procedure.
   - In case you need additional information or if I missed any crucial detail, 
please let me know.
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [ ] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [x] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to