kaushikranjan commented on issue #12704: URL: https://github.com/apache/iceberg/issues/12704#issuecomment-2781504606
We are using AWS spot resources. 3 executors - 8g 4 cores. driver - 8g 4 cores. The configuration I believe is on steroids for the size of data we are looking at. @RussellSpitzer We are overwriting (using MERGE INTO) the data in our iceberg table - every micro-batch We estimate the number of records to be around 1.6 million. The initial runs see INSERTs - thus causing the data files to increase per partition. To control the number of files we are planning to run rewrite_data_files. schema id : string, customer_id: string, ... other fields ... Our tables are partitioned on customer_id and sorted on id. We are running binpack strategy on rewrite_data_files. If my understanding is correct, this will sort data within the partition. -- Ironically, the same strategy works for us on a different pipeline for another data set of around ~3 million records. But fails on this one. I initially thought it was a data issue and tried to run it on FAKE data as well. But it just doesn't work and executors start to fail on binpack compaction. ``` df = spark.sql(f""" CALL nessie.system.rewrite_data_files( table => '{table_name}', options => map( 'min-file-size-bytes', '{minFileSizeBytes}', 'max-file-size-bytes', '{maxFileSizeBytes}', 'target-file-size-bytes', '{targetFileSizeBytes}', 'min-input-files', '{minInputFiles}', ) ) """) ``` What are we doing wrong? I am very sure it is a configuration issue! --- @RussellSpitzer if our data is sorted on a field - should we be running binpack strategy or should we go for sort-order? We only want sorted data within our partitions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org