RussellSpitzer opened a new pull request, #16839:
URL: https://github.com/apache/iceberg/pull/16839
## Why
Spark integration tests run against parameterized catalogs that all wrote to
local disk, which dominated wall-clock time. Switching the test catalogs to
`InMemoryFileIO` (and replacing the `HADOOP` parameterization with
`InMemoryCatalog`) keeps the same coverage but executes much faster in a single
JVM.
A single-class wall-clock measurement
(`TestRequiredDistributionAndOrdering`) showed roughly **1.93x speedup for
`testhive`** and **1.50x for `spark_catalog`**.
## What
### Core (one production change)
- Add `InMemoryCatalog.INSTANCE_NAMESPACE`
(`in-memory-catalog.instance-namespace`). Instances initialized with the same
value share namespaces, tables, and views through a JVM-wide store. Needed
because Spark clones sessions and tests build separate validation catalogs that
must observe each other's writes.
- `close()` no longer clears state when the instance is sharing.
- `clearInstanceNamespace(String)` for explicit test cleanup.
- New unit tests cover sharing, isolation across distinct namespaces, and
that `close()` does not wipe shared state.
### Spark tests
- `SparkCatalogConfig`: switch `HIVE`, `REST`, `SPARK_SESSION`,
`SPARK_SESSION_WITH_VIEWS`, `SPARK_WITH_HIVE_VIEWS`,
`SPARK_SESSION_WITH_UNIQUE_LOCATION`, `HIVE_WITH_UNIQUE_LOCATION` to use
`InMemoryFileIO`. Add an `INMEMORY` config backed by `InMemoryCatalog` (with
`INSTANCE_NAMESPACE`).
- `TestBaseWithCatalog`:
- Clear stale `spark.sql.catalog.<name>.*` properties from the shared
`SparkSession` before each test (the static session leaks config across
parameterizations).
- When a test's catalog config overrides `io-impl`, build a fresh
validation `HiveCatalog` with the same `FileIO` so reads round-trip.
- Initialize the validation `InMemoryCatalog` with the test catalog's
configuration so they share state.
- Pass `io-impl=InMemoryFileIO` to the `RESTServerExtension` backend.
- `CatalogTestBase`: replace the `HADOOP` parameter with `INMEMORY`.
- `TestDataFrameWriterV2`: parameterize the expected error message by
catalog name (no longer hardcoded to `testhadoop`).
### Tests opted out of in-memory I/O
- `TestCompressionSettings` and `TestMetadataTableReadableMetrics` filter
`io-impl` out of their parameter properties because they assert against raw
on-disk Parquet/ORC/Avro files via Hadoop readers (`ParquetFileReader`,
`OrcFile`, `Files.localOutput`). These continue to use `HadoopFileIO`.
## Risks / open questions
- This is a **test-only** behavior change in Spark, plus a **public
addition** to `InMemoryCatalog` (`INSTANCE_NAMESPACE` constant +
`clearInstanceNamespace`). The new property is opt-in.
- The change touches two modules. Reviewers may prefer splitting into:
1. `Core: InMemoryCatalog INSTANCE_NAMESPACE for shared in-process state`
2. `Spark: Use in-memory FileIO/Catalog in Spark 4.1 tests for faster runs`
Happy to split if requested.
- Several other tests in the suite hit pre-existing flakes
(`TestSparkReadMetrics.testDeleteMetrics` rounds planning duration to 0 under
in-memory I/O; `TestTimestampWithoutZone` has Hive Metastore namespace ordering
issues). These pass in isolation; not addressed here.
## Validation
- `./gradlew :iceberg-core:spotlessCheck
:iceberg-spark:iceberg-spark-4.1_2.13:spotlessCheck` clean
- `./gradlew :iceberg-core:test --tests
org.apache.iceberg.inmemory.TestInMemoryCatalog` passes (the three new tests
included)
- `./gradlew :iceberg-spark:iceberg-spark-4.1_2.13:compileTestJava` clean
- `spark.source` package-level run (60 test classes) — pre-existing flakes
only
## Draft
Draft PR — running `parallel-reviews` against this for the multi-perspective
review pass.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]