paultipper opened a new issue, #11953:
URL: https://github.com/apache/iceberg/issues/11953

   ### Apache Iceberg version
   
   1.6.1
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   Share
   
   
   I'm trying to use the Apache Spark MERGE INTO command to add/update some 
data from a source data frame into an Apache Iceberg table within an AWS Glue 
table using an AWS Glue job running Spark 3.5. If the source data frame is 
empty, then all of the existing data in the target table is deleted.
   
   Here is a sample of the Python code I'm using to do this:
   
   ```
   # df is a data frame of the source data, and is passed into this code block
   df.createOrReplaceTempView("source_data")
   
   # Get start year, month and day from start_date, which is a datetime object 
passed into this code block
   year = start_date.year
   month = start_date.month
   day = start_date.day
   print(f"start_date: {start_date}, year: {year}, month: {month}, day: {day}")
   # Generate the WHERE part of the statement
   where_clause = f"WHERE year >= {year} AND (year > {year} OR month >= 
{month}) AND (year > {year} OR month > {month} OR day >= {day})"
   
   selected_df = spark.sql(f"SELECT * FROM source_data {where_clause}")
   logger.info(f"New CSV rows selected for merging: {selected_df.count()}")
   selected_df.createOrReplaceTempView("new_data")
   
   
   MERGE INTO iceberg_catalog.db.target_table t
       USING new_data AS s
           ON (t.surrogate_key = s.surrogate_key)
       WHEN MATCHED THEN 
           UPDATE SET *
       WHEN NOT MATCHED THEN 
           INSERT *
   ```
   
   Before the MERGE INTO operation, the target table contains 8246 rows, and 
I've establised that the number of rows in the selected_df data frame was 0. My 
expectation is that merging `selected_df` into the target table should leave 
the target table with the same data as before, but I found that in fact that, 
after the MERGE INTO operation, the target table was empty. As I say, my 
assumption is that the MERGE INTO command will add any rows in `selected_df` 
that do not already exist into the target table; that it will update any rows 
that do exist, and will leave any rows that exist in the target table that are 
not in `selected_df` in place; is my assumption incorrect?
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [ ] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [X] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to