andygrove opened a new pull request, #4185:
URL: https://github.com/apache/datafusion-comet/pull/4185
## Which issue does this PR close?
<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases. You can
link an issue to this PR using the GitHub syntax. For example `Closes #123`
indicates that this PR will close issue #123.
-->
N/A. Audit-driven test coverage; no behavior change.
**Stacked on #4183.** This PR's branch is based on
`feat/legacy-time-parser-policy-tests`, so until #4183 merges the diff here
will include those commits as well.
## Rationale for this change
`spark.sql.optimizer.nestedSchemaPruning.enabled` (default `true`) is the
catalyst-level switch that lets columnar readers fetch only the leaves of a
nested column. Comet propagates the flag into Hadoop conf via
`CometParquetFileFormat.populateConf` and otherwise inherits Spark's
already-pruned `requiredSchema`, but Comet's own test tree had no end-to-end
coverage. Spark's own `ParquetSchemaPruningSuite` is patched in
`dev/diffs/*.diff` to recognize Comet scans, but that only validates
correctness when CI runs Spark tests, and doesn't lock in plan-level
expectations from inside Comet.
A SQL-file test cannot prove pruning happened: it only checks results, and
pruned-vs-unpruned reads usually return the same rows. Plan inspection is the
only way to catch a regression, so this audit uses Scala tests, mirroring
Spark's `checkScanSchemata` pattern.
## What changes are included in this PR?
- New suite:
`spark/src/test/scala/org/apache/comet/parquet/CometNestedSchemaPruningSuite.scala`.
Each scenario runs across `SCAN_NATIVE_DATAFUSION` and
`SCAN_NATIVE_ICEBERG_COMPAT` under V1 Parquet. A small helper walks the
executed plan, collects `requiredSchema` from any
`CometScanExec`/`CometNativeScanExec`, and asserts it matches an expected
catalog-string schema; results are then compared against Spark via
`checkSparkAnswer`. Scenarios:
- top-level struct field
- field inside array of struct
- field inside map value
- doubly-nested struct field
- projection plus filter on a nested field
- null at an intermediate struct level
- Plain Parquet V2 is excluded from the matrix because Comet's V2 scan rule
only covers CSV and Iceberg, so Parquet V2 stays as plain `BatchScanExec` and
there's no Comet scan to inspect. Documented in the suite's class comment and
the audit notes.
- Append a second entry to
`docs/source/contributor-guide/spark_configs_support.md` with the full audit
notes for `nestedSchemaPruning.enabled`: source semantics, current Comet
status, test layout, and findings.
This PR was scaffolded with the project's `audit-comet-expression` workflow
extended to a config-level audit, plus the `superpowers:brainstorming` and
`superpowers:using-git-worktrees` skills.
## How are these changes tested?
- `./mvnw test
-Dsuites=\"org.apache.comet.parquet.CometNestedSchemaPruningSuite\"
-Dtest=none` -- 12/12 pass on Spark 3.5.8 (default).
- `./mvnw test -Pspark-3.4
-Dsuites=\"org.apache.comet.parquet.CometNestedSchemaPruningSuite\"
-Dtest=none` -- 12/12 pass.
- `./mvnw test -Pspark-4.0
-Dsuites=\"org.apache.comet.parquet.CometNestedSchemaPruningSuite\"
-Dtest=none` -- 12/12 pass.
No Comet bugs were uncovered by the audit.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]