Baunsgaard opened a new pull request, #16821: URL: https://github.com/apache/iceberg/pull/16821
## What `TestSparkDataFile` validates that `DataFile` / `DeleteFile` metadata round-trips through Spark's `#data_files` / `#delete_files` metadata tables. Each test writes random rows through a 14-field `PartitionSpec` (`identity` / `bucket` / `hour` / `truncate` over the columns), so nearly every row lands in its own partition and produces a separate single-row data file, plus one position-delete file per data file. The cost therefore scales with the generated row count rather than with coverage. This reduces the generated row count from `200` to `40` in the Spark 3.5 / 4.0 / 4.1 test copies. ## Why it is safe - Coverage is per-row, not per-count: every generated row carries all 15 columns and produces a partition tuple across all 14 transforms, so 40 rows still exercises every column type and every partition transform. - Both delete-file types are still validated (one position delete per data file plus two equality deletes). - The deterministic seed (`0`) is unchanged, so the generated data is reproducible. - No assertion is weakened — only the number of sample files changes. ## Impact Measured locally (JDK 17, Spark 4.1 core, `--no-build-cache`), per method: | Method | before | after | |---|---:|---:| | testValueConversionPartitionedTable | 10.8s | 2.7s | | testValueConversionWithEmptyStats | 11.3s | 3.7s | | testValueConversion (unpartitioned) | 5.4s | 5.2s | | **Suite total** | **27.4s** | **11.5s (−58%)** | The unpartitioned case is unchanged (it writes a single file regardless of row count). This is a small, isolated suite (~1% of core test self-time), so the end-to-end CI impact is minor. ## Testing `./gradlew :iceberg-spark:iceberg-spark-4.1:test --tests org.apache.iceberg.spark.source.TestSparkDataFile` 3 tests, 0 failures. Also verified green on v3.5 (11.7s) and v4.0 (11.8s), identical pass/skip counts. Test-only change, applied identically across v3.5, v4.0, and v4.1. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
