thomas-pfeiffer commented on issue #14735:
URL: https://github.com/apache/iceberg/issues/14735#issuecomment-3626712226

   Hi @hemanthboyina,
   for both `rewrite_position_delete_files`, `rewrite_data_files` we activated 
the `partial-progress.enabled` option for this reason. For 
`compute_table_stats` there is no such option in the documentation. Is 
`partial-progress.enabled` implemented for  `compute_table_stats` as well?
   
   > Did you observed any exceptions for these procedures in your logs and 
still these procedures are succeeded ?
   
   Yes, I see some warnings like `org.apache.iceberg.util.Tasks 
[Rewrite-Service-4]     459     Retrying task after failure: sleepTimeMs=210 
Cannot commit glue_catalog.{database name}.{table_name} because base metadata 
location 's3://{s3_bucket_name}/{database 
name}.db/{table_name}/metadata/07304-cbdd94bd-d8ad-42c0-b74d-7547629e61c8.metadata.json'
 is not same as the current Glue location 's3://{s3_bucket_name}/{database 
name}.db/{table_name}/metadata/07305-f300df1c-9681-42b8-b7cb-f64251b2ba4b.metadata.json'`.
 But after a couple re-tries the commits succeeded.
   
   > Also can you please share the command you executed ? 
   
   Here is a condensed (pseudo-code) version of the script:
   ```py
   def maintain_table(spark: SparkSession, logger, database: str, table: str, 
...) -> None:
       logger.info(f"Maintaining Iceberg table '{database}.{table}'...")
       try:
           query: str = f"""
               DELETE FROM glue_catalog.{database}.{table} as target
               WHERE EXISTS (
                   SELECT 1 FROM (
                       SELECT ..., ROW_NUMBER() OVER (PARTITION BY ... ORDER BY 
... DESC NULLS LAST) AS row_number
                       FROM glue_catalog.{database}.{table}
                   ) WHERE row_number != 1 AND target... = ...
               )"""
           spark.sql(query)...
           logger.info(f"Deleted duplicates in Iceberg table 
'{database}.{table}': ...")
   
   
           query: str = f"CALL 
glue_catalog.system.rewrite_position_delete_files(table => 
'glue_catalog.{database}.{table}', options => map('partial-progress.enabled', 
true))"
           spark.sql(query)...
           logger.info(f"Position delete files rewritten for Iceberg table 
'{database}.{table}': ...")
   
           query: str = f"CALL glue_catalog.system.rewrite_data_files(table => 
'glue_catalog.{database}.{table}', strategy => 'sort', options => 
map('partial-progress.enabled', 'true', 'min-input-files', '5', 
'delete-file-threshold', '2', 'remove-dangling-deletes', 'true'))"
           spark.sql(query)...
           logger.info(f"Data files rewritten for Iceberg table 
'{database}.{table}': ...")
   
           query: str = f"CALL glue_catalog.system.expire_snapshots(table => 
'glue_catalog.{database}.{table}', stream_results => true)"
           spark.sql(query)...
           logger.info(f"Snapshots expired for Iceberg table 
'{database}.{table}': ...")
   
           query: str = f"CALL glue_catalog.system.remove_orphan_files(table => 
'glue_catalog.{database}.{table}')"
           spark.sql(query)...
           logger.info(f"Orphan files removed for Iceberg table 
'{database}.{table}': ...")
   
           query: str = f"CALL glue_catalog.system.compute_table_stats(table => 
'glue_catalog.{database}.{table}')"
           spark.sql(query)...
           logger.info(f"Calculate table statistics for Iceberg table 
'{database}.{table}': ...")
   
           logger.info(f"Maintenance for Iceberg table '{database}.{table}' is 
done.")
       except Exception as err:
           logger.error(
               f"Exception while maintaining Iceberg table 
'{database}.{table}': {err}"
           )
           raise err
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to