kaushikranjan commented on issue #12704:
URL: https://github.com/apache/iceberg/issues/12704#issuecomment-2781504606

   We are using AWS spot resources. 
   3 executors - 8g 4 cores.
   driver - 8g 4 cores.
   
   The configuration I believe is on steroids for the size of data we are 
looking at.
   
   @RussellSpitzer 
   
   We  are overwriting (using MERGE INTO) the data in our iceberg table - every 
micro-batch 
   We estimate the number of records to be around 1.6 million. The initial runs 
see INSERTs - thus causing the data files to increase per partition. To control 
the number of files we are planning to run rewrite_data_files. 
   
   schema
   id : string,
   customer_id: string,
   ... other fields ...
   
   Our tables are partitioned on customer_id and sorted on id. 
   We are running binpack strategy on rewrite_data_files. If my understanding 
is correct, this will sort data within the partition.
   
   --
   Ironically, the same strategy works for us on a different pipeline for 
another data set of around ~3 million records. But fails on this one.
   I initially thought it was a data issue and tried to run it on FAKE data as 
well. But it just doesn't work and executors start to fail on binpack 
compaction.
   
   ```
    df = spark.sql(f"""
            CALL nessie.system.rewrite_data_files(
                table => '{table_name}',
                options => map(
                   'min-file-size-bytes', '{minFileSizeBytes}',
                   'max-file-size-bytes', '{maxFileSizeBytes}',
                   'target-file-size-bytes', '{targetFileSizeBytes}',
                   'min-input-files', '{minInputFiles}',
                )
            )
       """)
   ```
   
   What are we doing wrong? I am very sure it is a configuration issue!
   
   ---
   
   @RussellSpitzer  if our data is sorted on a field - should we be running 
binpack strategy or should we go for sort-order?
   We only want sorted data within our partitions
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to