kaushikranjan opened a new issue, #12588: URL: https://github.com/apache/iceberg/issues/12588
### Query engine Spark ### Question I have a spark-streaming job which writes data from source to destination table using MERGE INTO MERGE INTO nessie.local.dst dst USING nessie.local.src src ON dst.id = src.id AND dst.employer = src.employer WHEN MATCHED THEN UPDATE SET dst.year = src.year WHEN NOT MATCHED THEN INSERT (id, year, employer, created_on, updated_on) VALUES (src.id, src.year, src.employer, src.created_on, src.updated_on) When the number of tasks to read the destination table goes beyond a threshold, I stop the streaming process and run compaction [to prevent any data corruption]. The destination table uses "merge-on-read" for all write modes. Post compaction, reading destination table causes spark to error out with Executors returning code:134. FYI - the total record count < 30,000 and total file size is ~4MB. I am running 3 executors with 4 cores, 8g memory and a driver with 4 cores, 8g memory. Is it a data-corruption issue? ** The destination table is partitioned on employer column and sorted on id. Do we need to specify strategy=sort when running rewrite_data_files? Or will the procedure pick up the strategy to sort - given that the table has sort keys already defined! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org