[I] Drop table purge issue for parquet tables with SparkSessionCatalog [iceberg]

via GitHub Tue, 16 Apr 2024 11:13:53 -0700


chinnaraolalam opened a new issue, #10157:
URL: https://github.com/apache/iceberg/issues/10157


   ### Apache Iceberg version
   
   1.4.3
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   Drop table purge issue for parquet tables with **SparkSessionCatalog**. 
   This was identified in Iceberg 1.4.3 + Spark 3.4.1 + SPARK-43203(On patch) 
in our environment
   
   **CASE 1**: Launching spark-sql session with **SessionCatalog**. Drop 
non-iceberg table like parquet table, it will purge data and not leave any data 
on disk.
   
   **CASE 2**: Launching spark-sql session with **SparkSessionCatalog**. Drop 
non-iceberg tables like parquet table, Drop will not purge the data and data 
will be left until manual cleanup. Here create table with same name fails.
   
   In CASE-1 and CASE-2 behaviour of parquet table is different. 
   more over launching spark-sql with **SparkSessionCatalog** is bringing 
behavioural change for non-iceberg tables(Its an issue).
   
   **Tested In cluster for** 
   
   CASE 1: Spark session launched with default spark catalog
   
   CREATE` TABLE parquettable (id bigint, data string) USING parquet;
   INSERT INTO parquettable VALUES (1,'A),(2,'B'),(3,'C');
   SELECT id,data FROM parquettable WHERE lenght(data) = 1;
   DROP TABLE parquettable;
   CREATE TABLE parquettable (id bigint, data string) USING parquet; --> This 
query SUCCESSFUL.
   
   CASE 2: Spark session launched with iceberg SparkSessionCatalog (--conf 
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog)
   
   CREATE` TABLE parquettable (id bigint, data string) USING parquet;
   INSERT INTO parquettable VALUES (1,'A),(2,'B'),(3,'C');
   SELECT id,data FROM parquettable WHERE lenght(data) = 1;
   DROP TABLE parquettable;
   CREATE TABLE parquettable (id bigint, data string) USING parquet; --> This 
query failed with exception [LOCATION_ALREADY_EXIST] (as drop table purge not 
happened)
   
   Same scenario in both sessions behaving differently. Here session launched 
with iceberg SparkSessionCatalog forcing to use purge for drop tables of 
non-iceberg tables.
   
   IIUC, iceberg SparkSessionCatalog can work for both iceberg tables and 
non-iceberg tables. So for non-iceberg tables it should fallback to spark 
default catalog and behaviour should be same as spark.
   But in the above case, creating parquet table with same name is failed due 
to purge not happened (This is not the behaviour in spark). Only for iceberg 
tables by default purge was off.
   
   On further analysis, my suspect is SparkSessionCatalog.dropTable(Identifier 
ident), drop table with icebergCatalog.dropTable(ident) here purge is off and 
returned from here, 
   Purge off was send to spark (I guess this change is due to 
https://issues.apache.org/jira/browse/SPARK-43203 which was done on 3.4.2)
   
   So to fix this update SparkSessionCatalog.dropTable(Identifier ident) as 
below
   
   public boolean dropTable(Identifier ident) {
                if (icebergCatalog.tableExists(ident)) {
                   return icebergCatalog.dropTable(ident);
               } else {
                  return getSessionCatalog().dropTable(ident);
               }
   }
   
   To reproduce this issue on main branch (spark 3.5 is default), Reverted 
purge in tests which are added as part of 
https://github.com/apache/iceberg/pull/9187. Multiple tests are failing
   and after updating dropTable as above all tests are passing. For the same 
created patch(This is to demonstrate the issue not the final fix)
   
   Need to check 
   1. What is behaviour of iceberg tables now.
   2. What about other api's of SparkSessionCatalog.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Drop table purge issue for parquet tables with SparkSessionCatalog [iceberg]

Reply via email to