dchristle commented on issue #6669:
URL: https://github.com/apache/iceberg/issues/6669#issuecomment-1688862454

   We encounter this same bug: many of our tables are rewritten to have exactly 
the same number of partitions as they started with. No updates to the table 
happen between file rewrites. This has happened for a while on previous 
versions of Spark and Iceberg, and is still present when using Iceberg 1.3.1 on 
Spark 3.4.1. 
   
   
![spark_iceberg_table_rewrite_bug](https://github.com/apache/iceberg/assets/5257839/6fb8f54e-6b95-4ad4-ad3a-c1f18d59a699)
   
   Querying the snapshots metadata for this table shows:
   ```
   spark.read.format("iceberg")
       .load("catalog.database.table_name.snapshots")
       .select("operation", 
               "summary.deleted-data-files", "summary.added-data-files", 
               "summary.removed-files-size", "summary.added-files-size")
       .show(500, truncate=false)
   
+---------+------------------+----------------+------------------+----------------+
   
|operation|deleted-data-files|added-data-files|removed-files-size|added-files-size|
   
+---------+------------------+----------------+------------------+----------------+
   |replace  |461               |461             |65130575502       
|65130575502     |
   |replace  |760               |760             |107250612676      
|107250612676    |
   |replace  |759               |759             |107248686823      
|107248225139    |
   |replace  |760               |760             |107366099394      
|107366031533    |
   |replace  |759               |759             |107256949232      
|107256949232    |
   |replace  |760               |760             |107320206910      
|107320138775    |
   |replace  |753               |753             |107326458293      
|107325552272    |
   |replace  |760               |760             |107304881346      
|107304798912    |
   |replace  |null              |null            |null              |null       
     |
   |replace  |459               |459             |64992219554       
|64991817897     |
   |replace  |761               |761             |107340441274      
|107340253045    |
   |replace  |760               |760             |107360953592      
|107360953592    |
   |replace  |753               |753             |107325552272      
|107325383411    |
   |replace  |759               |759             |107256949232      
|107256949232    |
   |replace  |761               |761             |107314333169      
|107314550136    |
   |replace  |760               |760             |107251467603      
|107250934878    |
   |replace  |null              |null            |null              |null       
     |
   |replace  |461               |461             |65238736381       
|65238736381     |
   |replace  |760               |760             |107351232238      
|107351343956    |
   |replace  |759               |759             |107243810268      
|107244081909    |
   |replace  |759               |759             |107282034554      
|107282034554    |
   |replace  |760               |760             |107261323675      
|107261455992    |
   |replace  |762               |762             |107261642516      
|107260731540    |
   |replace  |759               |759             |107265131201      
|107265131201    |
   |replace  |null              |null            |null              |null       
     |
   |replace  |461               |461             |65238736381       
|65238736381     |
   |replace  |759               |759             |107282034554      
|107281765115    |
   |replace  |759               |759             |107244081909      
|107244128779    |
   |replace  |760               |760             |107261455992      
|107261455276    |
   |replace  |752               |752             |107297898703      
|107297898703    |
   |replace  |760               |760             |107351343956      
|107351334827    |
   |replace  |762               |762             |107260731540      
|107260731540    |
   |replace  |759               |759             |107265131201      
|107265124820    |
   |replace  |null              |null            |null              |null       
     |
   
+---------+------------------+----------------+------------------+----------------+
   ```
   
   This table has been unmodified for some time, but it keeps getting 
rewritten. Using `SHOW CREATE TABLE` shows the following table properties:
   ```
   TBLPROPERTIES (
     'current-snapshot-id' = '7858704354037519378',
     'format' = 'iceberg/parquet',
     'format-version' = '1',
     'read.parquet.vectorization.enabled' = 'true',
     'write.parquet.compression-codec' = 'zstd',
     'write.parquet.compression-level' = '4',
     'write.target-file-size-bytes' = '268435456')
   ```
   
   In the bucket, the most recent set of recently written Parquet files are all 
around 130MB, rather than the 256MB target. A sample:
   
![recent_rewritten_file_sizes](https://github.com/apache/iceberg/assets/5257839/57e3cfbd-6800-45da-9fac-fecd39f43798)
   
   At least in this case, it seems like the output files aren't being written 
close enough to the target size to avoid being reselected for a rewrite in the 
next run. IIRC, the default ratio is 0.75, and these files are about 0.5 of the 
target. We aren't specifying the target file size in the rewrite option, 
either, so the 256MB target should be being used, AFAIK. For reference, our 
exact command is:
   ```
   SparkActions
       .get()
       .rewriteDataFiles(icebergTable)
       .option("delete-file-threshold", "1")
       .option("min-input-files", "5")
       .option("partial-progress.enabled", "true")
       .option("max-concurrent-file-group-rewrites", "30")
       .execute()
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to