andygrove opened a new pull request, #4129:
URL: https://github.com/apache/datafusion-comet/pull/4129
## Which issue does this PR close?
Closes #.
## Rationale for this change
Each Spark-version-specific TPC-DS plan-stability directory currently stores
a full copy of every query's golden plan, even when that query's plan is
byte-identical to the previous version's. After the recent Spark 4.1 enablement
(#4106), the per-version `approved-plans-*-spark4_1` directories duplicated 196
of 206 (v1_4) and 60 of 64 (v2_7) plans against the 4.0 directory, which itself
was largely a duplicate of 3.5, which was largely a duplicate of the base. This
pattern grows quadratically as new versions are added.
This PR teaches `CometPlanStabilitySuite` to fall back through the chain of
older version directories at read time, so each version-specific directory only
needs to contain the queries whose plans actually diverge.
## What changes are included in this PR?
- `CometPlanStabilitySuite` gains a `protected def fallbackGoldenFilePaths:
Seq[String]` hook. `getDirForTest` is unchanged in regenerate mode (always
writes to the primary directory). For read-time lookups, when the primary
`goldenFilePath` does not contain a directory for the active query, the suite
walks the fallback list and returns the first existing directory.
- Both `CometTPCDSV1_4_PlanStabilitySuite` and
`CometTPCDSV2_7_PlanStabilitySuite` build their fallback chain via
`CometPlanStabilitySuite.planNameChain(variant)`, which assembles the right
sequence based on `isSpark{35,40,41}Plus`. The chain ends at the unsuffixed
base directory.
- `dev/regenerate-golden-files.sh` now prunes any query directory whose
contents match what the fallback chain would resolve to, immediately after
regenerating each version. This keeps the directories sparse without manual
cleanup. The allowed `--spark-version` values are extended to include `4.1`
(now supported on `main`).
- Applies the prune once across the existing tree, removing 766 duplicate
query directories:
| directory | before | after |
| --- | --- | --- |
| `approved-plans-v1_4-spark3_5` | 206 | 10 |
| `approved-plans-v1_4-spark4_0` | 206 | 12 |
| `approved-plans-v1_4-spark4_1` | 206 | 10 |
| `approved-plans-v2_7-spark3_5` | 64 | 4 |
| `approved-plans-v2_7-spark4_0` | 64 | 4 |
| `approved-plans-v2_7-spark4_1` | 64 | 4 |
## How are these changes tested?
Locally, `CometTPCDSV1_4_PlanStabilitySuite` (194 tests) and
`CometTPCDSV2_7_PlanStabilitySuite` (64 tests) pass with 0 failures against
`-Pspark-3.5`, `-Pspark-4.0`, and `-Pspark-4.1` after the prune. Spark 3.4
reads only the base directory and is unaffected. CI will exercise all four
matrix profiles.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]