amabilee commented on issue #11484:
URL: https://github.com/apache/iceberg/issues/11484#issuecomment-2462339858

   Hey there!
   
   The reason `Spark` with `HiveCatalog` doesn't use the existing purge code 
from `HiveCatalog#dropTable` for its purge operation is primarily due to 
**performance** and **storage** considerations.
   
   When you use the `PURGE` option in `Hive`, it immediately deletes the 
underlying data files without moving them to a temporary holding area like the 
HDFS trashcan. This can be crucial for performance, storage, and security 
reasons, especially when dealing with large datasets or sensitive information1.
   
   However, when `Spark SQL` performs a `DROP TABLE` operation with the `PURGE` 
clause, it doesn't pass this clause along to the Hive statement that performs 
the drop table operation behind the scenes. Therefore, the purge behavior isn't 
applied as expected.
   
   To ensure the purge operation is performed correctly, it's recommended to 
execute the `DROP TABLE` operation directly in Hive, for example, through the 
Hive CLI (command-line interface), rather than through Spark SQL.
   
   Here is the reference: 
https://docs.cloudera.com/runtime/latest/developing-spark-applications/topics/spark-sql-drop-table-purge-considerations.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to