amabilee commented on issue #11484: URL: https://github.com/apache/iceberg/issues/11484#issuecomment-2462339858
Hey there! The reason `Spark` with `HiveCatalog` doesn't use the existing purge code from `HiveCatalog#dropTable` for its purge operation is primarily due to **performance** and **storage** considerations. When you use the `PURGE` option in `Hive`, it immediately deletes the underlying data files without moving them to a temporary holding area like the HDFS trashcan. This can be crucial for performance, storage, and security reasons, especially when dealing with large datasets or sensitive information1. However, when `Spark SQL` performs a `DROP TABLE` operation with the `PURGE` clause, it doesn't pass this clause along to the Hive statement that performs the drop table operation behind the scenes. Therefore, the purge behavior isn't applied as expected. To ensure the purge operation is performed correctly, it's recommended to execute the `DROP TABLE` operation directly in Hive, for example, through the Hive CLI (command-line interface), rather than through Spark SQL. Here is the reference: https://docs.cloudera.com/runtime/latest/developing-spark-applications/topics/spark-sql-drop-table-purge-considerations.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org