andygrove opened a new pull request, #4185:
URL: https://github.com/apache/datafusion-comet/pull/4185

   ## Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and 
enhancements and this helps us generate change logs for our releases. You can 
link an issue to this PR using the GitHub syntax. For example `Closes #123` 
indicates that this PR will close issue #123.
   -->
   
   N/A. Audit-driven test coverage; no behavior change.
   
   **Stacked on #4183.** This PR's branch is based on 
`feat/legacy-time-parser-policy-tests`, so until #4183 merges the diff here 
will include those commits as well.
   
   ## Rationale for this change
   
   `spark.sql.optimizer.nestedSchemaPruning.enabled` (default `true`) is the 
catalyst-level switch that lets columnar readers fetch only the leaves of a 
nested column. Comet propagates the flag into Hadoop conf via 
`CometParquetFileFormat.populateConf` and otherwise inherits Spark's 
already-pruned `requiredSchema`, but Comet's own test tree had no end-to-end 
coverage. Spark's own `ParquetSchemaPruningSuite` is patched in 
`dev/diffs/*.diff` to recognize Comet scans, but that only validates 
correctness when CI runs Spark tests, and doesn't lock in plan-level 
expectations from inside Comet.
   
   A SQL-file test cannot prove pruning happened: it only checks results, and 
pruned-vs-unpruned reads usually return the same rows. Plan inspection is the 
only way to catch a regression, so this audit uses Scala tests, mirroring 
Spark's `checkScanSchemata` pattern.
   
   ## What changes are included in this PR?
   
   - New suite: 
`spark/src/test/scala/org/apache/comet/parquet/CometNestedSchemaPruningSuite.scala`.
 Each scenario runs across `SCAN_NATIVE_DATAFUSION` and 
`SCAN_NATIVE_ICEBERG_COMPAT` under V1 Parquet. A small helper walks the 
executed plan, collects `requiredSchema` from any 
`CometScanExec`/`CometNativeScanExec`, and asserts it matches an expected 
catalog-string schema; results are then compared against Spark via 
`checkSparkAnswer`. Scenarios:
     - top-level struct field
     - field inside array of struct
     - field inside map value
     - doubly-nested struct field
     - projection plus filter on a nested field
     - null at an intermediate struct level
   - Plain Parquet V2 is excluded from the matrix because Comet's V2 scan rule 
only covers CSV and Iceberg, so Parquet V2 stays as plain `BatchScanExec` and 
there's no Comet scan to inspect. Documented in the suite's class comment and 
the audit notes.
   - Append a second entry to 
`docs/source/contributor-guide/spark_configs_support.md` with the full audit 
notes for `nestedSchemaPruning.enabled`: source semantics, current Comet 
status, test layout, and findings.
   
   This PR was scaffolded with the project's `audit-comet-expression` workflow 
extended to a config-level audit, plus the `superpowers:brainstorming` and 
`superpowers:using-git-worktrees` skills.
   
   ## How are these changes tested?
   
   - `./mvnw test 
-Dsuites=\"org.apache.comet.parquet.CometNestedSchemaPruningSuite\" 
-Dtest=none` -- 12/12 pass on Spark 3.5.8 (default).
   - `./mvnw test -Pspark-3.4 
-Dsuites=\"org.apache.comet.parquet.CometNestedSchemaPruningSuite\" 
-Dtest=none` -- 12/12 pass.
   - `./mvnw test -Pspark-4.0 
-Dsuites=\"org.apache.comet.parquet.CometNestedSchemaPruningSuite\" 
-Dtest=none` -- 12/12 pass.
   
   No Comet bugs were uncovered by the audit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to