Baunsgaard opened a new pull request, #16740:
URL: https://github.com/apache/iceberg/pull/16740
## What
`TestRewriteDataFilesAction` is the slowest single class in the Spark core
test
module. Each test materializes a large (`SCALE = 400000`-row) input table
via a
Spark write before exercising the rewrite under test, and many tests reuse
the
same input shape across the `formatVersion` matrix.
This caches the written input data files keyed by table shape
(`formatVersion`,
spec, `files`, `rows`, `partitions`, properties) and reuses them by
re-appending
the cached `DataFile`s to a fresh table. The expensive Spark write of the
input
now runs once per JVM fork instead of once per test; the rewrite under test
still
runs per test on its own fresh table.
Applied identically to Spark 3.5, 4.0 and 4.1.
## Why it is safe
- The generated data is deterministic (fixed `Random(42)` seed), so reuse is
byte-identical to regenerating it.
- Cached files live in a static `@TempDir`, so they survive across tests
(not wiped
by the per-test temp dir) and are cleaned up after the class.
- `includeColumnStats()` is used when collecting the cached files so
lower/upper
bounds and value counts are preserved on re-append.
- The rewrite under test is unchanged and still runs per test, so no
assertion is
weakened.
## Results (local, JDK 17, 32 cores)
| Scope | baseline | with cache |
|---|---:|---:|
| `TestRewriteDataFilesAction` (single-thread) | 705s | 455s |
| full `iceberg-spark-3.5_2.13` core module (`testParallelism=auto`) |
18m56s | 14m57s |
Test/skip counts are unchanged at both class and module level:
Spark 3.5 = 168 tests / 6 skipped / 0 failed; Spark 4.0 & 4.1 = 171 / 6 / 0.
## Notes
- Scoped to the `createTable(int)` / `createTablePartitioned(...)` helpers;
the few
in-test `writeRecords(..., SCALE, ...)` call sites are not yet cached.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]