dchristle commented on issue #6669: URL: https://github.com/apache/iceberg/issues/6669#issuecomment-1688862454
We encounter this same bug: many of our tables are rewritten to have exactly the same number of partitions as they started with. No updates to the table happen between file rewrites. This has happened for a while on previous versions of Spark and Iceberg, and is still present when using Iceberg 1.3.1 on Spark 3.4.1.  Querying the snapshots metadata for this table shows: ``` spark.read.format("iceberg") .load("catalog.database.table_name.snapshots") .select("operation", "summary.deleted-data-files", "summary.added-data-files", "summary.removed-files-size", "summary.added-files-size") .show(500, truncate=false) +---------+------------------+----------------+------------------+----------------+ |operation|deleted-data-files|added-data-files|removed-files-size|added-files-size| +---------+------------------+----------------+------------------+----------------+ |replace |461 |461 |65130575502 |65130575502 | |replace |760 |760 |107250612676 |107250612676 | |replace |759 |759 |107248686823 |107248225139 | |replace |760 |760 |107366099394 |107366031533 | |replace |759 |759 |107256949232 |107256949232 | |replace |760 |760 |107320206910 |107320138775 | |replace |753 |753 |107326458293 |107325552272 | |replace |760 |760 |107304881346 |107304798912 | |replace |null |null |null |null | |replace |459 |459 |64992219554 |64991817897 | |replace |761 |761 |107340441274 |107340253045 | |replace |760 |760 |107360953592 |107360953592 | |replace |753 |753 |107325552272 |107325383411 | |replace |759 |759 |107256949232 |107256949232 | |replace |761 |761 |107314333169 |107314550136 | |replace |760 |760 |107251467603 |107250934878 | |replace |null |null |null |null | |replace |461 |461 |65238736381 |65238736381 | |replace |760 |760 |107351232238 |107351343956 | |replace |759 |759 |107243810268 |107244081909 | |replace |759 |759 |107282034554 |107282034554 | |replace |760 |760 |107261323675 |107261455992 | |replace |762 |762 |107261642516 |107260731540 | |replace |759 |759 |107265131201 |107265131201 | |replace |null |null |null |null | |replace |461 |461 |65238736381 |65238736381 | |replace |759 |759 |107282034554 |107281765115 | |replace |759 |759 |107244081909 |107244128779 | |replace |760 |760 |107261455992 |107261455276 | |replace |752 |752 |107297898703 |107297898703 | |replace |760 |760 |107351343956 |107351334827 | |replace |762 |762 |107260731540 |107260731540 | |replace |759 |759 |107265131201 |107265124820 | |replace |null |null |null |null | +---------+------------------+----------------+------------------+----------------+ ``` This table has been unmodified for some time, but it keeps getting rewritten. Using `SHOW CREATE TABLE` shows the following table properties: ``` TBLPROPERTIES ( 'current-snapshot-id' = '7858704354037519378', 'format' = 'iceberg/parquet', 'format-version' = '1', 'read.parquet.vectorization.enabled' = 'true', 'write.parquet.compression-codec' = 'zstd', 'write.parquet.compression-level' = '4', 'write.target-file-size-bytes' = '268435456') ``` In the bucket, the most recent set of recently written Parquet files are all around 130MB, rather than the 256MB target. A sample:  At least in this case, it seems like the output files aren't being written close enough to the target size to avoid being reselected for a rewrite in the next run. IIRC, the default ratio is 0.75, and these files are about 0.5 of the target. We aren't specifying the target file size in the rewrite option, either, so the 256MB target should be being used, AFAIK. For reference, our exact command is: ``` SparkActions .get() .rewriteDataFiles(icebergTable) .option("delete-file-threshold", "1") .option("min-input-files", "5") .option("partial-progress.enabled", "true") .option("max-concurrent-file-group-rewrites", "30") .execute() ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
