Re: [PR] Spark: Spark tests cache rewrite input [iceberg]

via GitHub Wed, 10 Jun 2026 09:40:50 -0700


Baunsgaard commented on code in PR #16740:
URL: https://github.com/apache/iceberg/pull/16740#discussion_r3389952253



##########
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java:
##########
@@ -143,6 +150,15 @@ public class TestRewriteDataFilesAction extends TestBase {
   @TempDir private File tableDir;
   private static final int SCALE = 400000;
 
+  // Cache of pre-written input data files keyed by table shape 
(schema/spec/props are
+  // fixed per key), so identical large inputs are materialized via Spark only 
once per JVM
+  // fork and reused by every test that asks for the same shape. The Spark 
write of SCALE
+  // rows dominates these tests; the rewrite under test still runs per test on 
a fresh table.
+  @TempDir private static Path inputCacheDir;
+  private static final Map<String, List<DataFile>> INPUT_FILE_CACHE = 
Maps.newConcurrentMap();

Review Comment:
   Okay, accordingly added an `@AfterAll` that clears the cache + lock map and 
resets the seq. `@AfterAll` runs once after all tests, so within-run cross-test 
caching is unchanged; it only stops a second in-JVM run (IDE re-run) from 
returning DataFiles pointing into the recreated `@TempDir`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spark: Spark tests cache rewrite input [iceberg]

Reply via email to