Baunsgaard opened a new pull request, #16821:
URL: https://github.com/apache/iceberg/pull/16821

   ## What
   
   `TestSparkDataFile` validates that `DataFile` / `DeleteFile` metadata 
round-trips through Spark's `#data_files` / `#delete_files` metadata tables. 
Each test writes random rows through a 14-field `PartitionSpec` (`identity` / 
`bucket` / `hour` / `truncate` over the columns), so nearly every row lands in 
its own partition and produces a separate single-row data file, plus one 
position-delete file per data file. The cost therefore scales with the 
generated row count rather than with coverage.
   
   This reduces the generated row count from `200` to `40` in the Spark 3.5 / 
4.0 / 4.1 test copies.
   
   ## Why it is safe
   
   - Coverage is per-row, not per-count: every generated row carries all 15 
columns and produces a partition tuple across all 14 transforms, so 40 rows 
still exercises every column type and every partition transform.
   - Both delete-file types are still validated (one position delete per data 
file plus two equality deletes).
   - The deterministic seed (`0`) is unchanged, so the generated data is 
reproducible.
   - No assertion is weakened — only the number of sample files changes.
   
   
   ## Impact
   
   Measured locally (JDK 17, Spark 4.1 core, `--no-build-cache`), per method:
   
   | Method | before | after |
   |---|---:|---:|
   | testValueConversionPartitionedTable | 10.8s | 2.7s |
   | testValueConversionWithEmptyStats | 11.3s | 3.7s |
   | testValueConversion (unpartitioned) | 5.4s | 5.2s |
   | **Suite total** | **27.4s** | **11.5s (−58%)** |
   
   
   The unpartitioned case is unchanged (it writes a single file regardless of 
row count). This is a small, isolated suite (~1% of core test self-time), so 
the end-to-end CI impact is minor.
   
   ## Testing
   
   `./gradlew :iceberg-spark:iceberg-spark-4.1:test --tests 
org.apache.iceberg.spark.source.TestSparkDataFile`
   
   3 tests, 0 failures. Also verified green on v3.5 (11.7s) and v4.0 (11.8s), 
identical pass/skip counts. Test-only change, applied identically across v3.5, 
v4.0, and v4.1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to