[I] Unexpected behavior when execute CREATE OR REPLACE AS SELECT with REST Catalog [iceberg]

via GitHub Mon, 07 Apr 2025 01:34:08 -0700


baotran306 opened a new issue, #12738:
URL: https://github.com/apache/iceberg/issues/12738


   ### Apache Iceberg version
   
   1.7.2
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   
   ### Describe bug
   When I try this command`CREATE OR REPLACE TABLE AS SELECT`, it works well 
for several cases including:
   - Replace table with structure only (I try to limit 0 in my select statement)
   - Replace table from non-partition table to partition table
   - Replace table from identity partition column to another identity partition 
column
   
   However, there's one case I've tested that was unsuccessful: attempting to 
replace a table with a transform partitiontion to non-partitioned.
   
   ### Step reproduce
   Firstly, I created an empty non-partitioned table using the CREATE TABLE 
command, followed by inserting test data. Next, I created another partitioned 
table with a different structure using the same command as in the first step.
   
   
![Image](https://github.com/user-attachments/assets/fe607428-3a50-4b45-8b8e-8e260b7a1567)
   
   Then, I tried to replace the existing partitioned table using the CREATE OR 
REPLACE TABLE AS SELECT command along with my non-partitioned table. My 
expectation was that the table would be completely replaced, including both the 
schema and data, with the non-partitioned table successfully. However, it 
didn't work and raised `org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 0 in stage 24.0 failed 1 times, most recent failure: Lost 
task 0.0 in stage 24.0 (TID 50) (10.16.14.4 executor driver): 
org.apache.iceberg.exceptions.ValidationException: Cannot find source column 
for partition field: 1000: id_bucket: bucket[16](1)`
   
   
![Image](https://github.com/user-attachments/assets/df073e8d-df6a-444d-ab8c-359f659ba361)
   
   Okay, I tried one more time. Before replacing the table, I dropped all 
partition columns using ALTER TABLE DROP PARTITION FIELD. However, it still 
resulted in the same errors. I don't understand why it is still checking for 
partition fields when I had already dropped them before replacing the 
table.(I've already checked metadata file, and there's no partition spec now as 
I dropped before)
   
   
![Image](https://github.com/user-attachments/assets/b49cd311-6d5f-4e56-a183-4c81af917ff1)
   
   ### My code: Spark + Rest Catalog(catalog_name = 'lakekeeper')
   ```python
   # Create non-partition table
   spark.sql("""
       CREATE TABLE IF NOT EXISTS 
lakekeeper.risk_mon_out.my_data_non_partitioned(
           vendor_id int,
           product string,
           price double,
           event_time timestamp      
       )
   """)
   spark.sql("""
       INSERT INTO lakekeeper.risk_mon_out.my_data_non_partitioned VALUES
             (1, "Car", 2.34, TIMESTAMP '2025-03-01 11:03:01'),
             (2, "Bike", 5.43, TIMESTAMP '2025-03-01 11:19:21'),
             (3, "House", 100.4, TIMESTAMP '2025-03-01 12:43:17'),
             (4, "Plane", 124.4, TIMESTAMP '2025-03-01 13:05:29')
   """)
   spark.sql("SELECT * FROM 
lakekeeper.risk_mon_out.my_data_non_partitioned").show()
   
   # Create partition table with transformation
   spark.sql("""
       CREATE TABLE lakekeeper.risk_mon_out.data_sample_test(
           id int,
           category string,
           event_time timestamp      
       )
       PARTITIONED BY (
           bucket(16, id),
           category
       )
   """)
   spark.sql("""
       INSERT INTO lakekeeper.risk_mon_out.data_sample_test VALUES
             (5, "HOVal", TIMESTAMP '2025-03-01 11:03:09'),
             (6, "NextMi", TIMESTAMP '2025-03-01 12:19:29'),
             (7, "OGel", TIMESTAMP '2025-01-09 06:43:17'),
             (8, "AAAA", TIMESTAMP '2025-01-09 06:05:29')
   """)
   spark.sql("SELECT * FROM lakekeeper.risk_mon_out.data_sample_test").show()
   
   # Replace partition table to non-partition table
   try:
       spark.sql("""
           CREATE OR REPLACE TABLE lakekeeper.risk_mon_out.data_sample_test
           USING iceberg
           AS SELECT * FROM lakekeeper.risk_mon_out.my_data_non_partitioned
       """)
   except Exception as e:
       print(f"Error 111: {e}")
   
   try:
       spark.sql("ALTER TABLE lakekeeper.risk_mon_out.data_sample_test DROP 
PARTITION FIELD id_bucket")
       spark.sql("ALTER TABLE lakekeeper.risk_mon_out.data_sample_test DROP 
PARTITION FIELD category")
       print("Drop done")
       spark.sql("""
           CREATE OR REPLACE TABLE lakekeeper.risk_mon_out.data_sample_test
           USING iceberg
           AS SELECT * FROM lakekeeper.risk_mon_out.my_data_non_partitioned
       """)
   except Exception as e:
       print("=====")
       print(f"Error 222: {e}")
   ```
   
   ### Expected Behavior
   Table will be replaced entirely with both schema and data.
   
   ### Environment
   Apache Iceberg Version: 1.7.2
   Execution Engines Tested: Spark 3.5.2
   Catalog Type: REST catalog
   Storage: S3-compatible
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [ ] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [x] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Unexpected behavior when execute CREATE OR REPLACE AS SELECT with REST Catalog [iceberg]

Reply via email to