andygrove opened a new pull request, #3977:
URL: https://github.com/apache/datafusion-comet/pull/3977
## Which issue does this PR close?
Related to #3949.
## Rationale for this change
Issue #3949 reports an intermittent `AssertionError` from
`ColumnarToRowExec.<init>` in the awslabs TPC-DS benchmark (q14a, q14b, q31,
q47, q57 on Spark 4.0). Prior repro attempts using small DPP queries failed,
and 1TB local runs did not reproduce either. This PR adds infrastructure to
stress the Spark/Comet boundary and probe the canonicalization invariants the
stack trace points at, so future regressions around mixed-mode plan shapes have
automated coverage.
The full stack trace shared on the issue shows the assertion fires from
`QueryPlan.doCanonicalize` → `withNewChildren(canonicalizedChildren)` →
`ColumnarToRowExec.copy(...)` inside `AdaptiveSparkPlanExec.createQueryStages`.
That means `cometPlan.canonicalized.supportsColumnar` is `false` for some Comet
plan produced during post-stage-materialization re-planning. None of the suites
in this PR reproduce the bug yet — the bad canonical form does not appear in
the initial physical plan — but the tools are useful regardless.
## What changes are included in this PR?
- **`FuzzFallback` utility**
(`spark/src/main/scala/org/apache/comet/FuzzFallback.scala`) — seeded,
node-identity-keyed decision cache. Calling `shouldVetoShuffle(s)` or
`shouldVetoExec(op)` twice for the same node returns the same answer, so
`getSupportLevel` and `createExec` stay consistent.
- **Injection points**:
- `CometShuffleExchangeExec.nativeShuffleSupported` and
`columnarShuffleSupported`
- The generic operator branch of `CometExecRule.convertNode`
- **Four new `testing`-category configs, all default-off**:
- `spark.comet.fuzz.fallback.enabled`
- `spark.comet.fuzz.fallback.seed`
- `spark.comet.fuzz.fallback.shuffleVetoProbability` (default 0.5)
- `spark.comet.fuzz.fallback.execVetoProbability` (default 0.0)
- **`CometFuzzFallbackSuite`** — all 129 TPC-DS queries (v1.4 + v2.7) × N
seeds, plan-only (no data required; uses `TPCDSBase` schemas).
- **`CometFuzzDppSuite`** — small partitioned fact/dim data, actually
executes DPP-flavored queries across SMJ/BHJ/coalesce variants with AQE on.
- **`CometCanonicalizationSuite`** — targeted node-level canonicalization
checks for scan/filter/project/aggregate/BHJ/DPP shapes, both directly
(`p.canonicalized.supportsColumnar`) and wrapped
(`ColumnarToRowExec(p).canonicalized` — mirrors the #3949 stack exactly).
- **`CometCanonicalizationTpcdsSuite`** — same checks across every TPC-DS
query plan.
## How are these changes tested?
All four suites run green locally (Spark 4.0):
- `CometFuzzFallbackSuite`: 129 tests × 2 seeds pass in ~20s.
- `CometFuzzDppSuite`: 900 iterations (3 variants × 3 queries × 100 seeds)
pass in ~2 min.
- `CometCanonicalizationSuite`: 6 tests, ~10s.
- `CometCanonicalizationTpcdsSuite`: 129 TPC-DS queries, ~8s.
All new configs default to off; no existing behavior changes when the fuzz
harness is disabled.
Draft because the primary goal — reproducing #3949 — is not met by any of
these suites. Leaving it in draft until either (a) one of the suites catches
the bug after further tuning, (b) a real failing plan arrives from awslabs so
we can tighten the canonicalization tests, or (c) we agree to merge the
infrastructure as-is as a regression guard even though it doesn't demonstrate
the repro today.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]