andygrove opened a new pull request, #4087:
URL: https://github.com/apache/datafusion-comet/pull/4087
## Which issue does this PR close?
Part of #3720 (does not close it; this PR encodes current behavior rather
than fixing divergences).
## Rationale for this change
Issue #3720 lists several Parquet schema-mismatch cases where Comet diverges
from Spark, but no permanent test captures the per-case, per-scan-impl,
per-Spark-version behavior. The divergences are tracked only in the issue text
and a couple of ad-hoc tests in `ParquetReadSuite`.
Adding a focused suite gives us:
- a regression guard for every case in #3720,
- a behavior matrix that shows at a glance how `native_datafusion` and
`native_iceberg_compat` differ from Spark on each Spark version, and
- a concrete reference future fixes can target (and the test+matrix to
update when a fix lands).
## What changes are included in this PR?
A new test suite at
`spark/src/test/scala/org/apache/comet/parquet/ParquetSchemaMismatchSuite.scala`
with 18 tests covering 9 cases (7 from #3720 plus 2 harmless control
widenings) under the two Comet scan implementations (`native_datafusion`,
`native_iceberg_compat`).
Each test asserts Comet's actual current behavior. Spark's reference
behavior is documented in per-case comments and in a behavior matrix at the top
of the file. Where Comet's behavior depends on the Spark version (Spark 4.0
enables `COMET_SCHEMA_EVOLUTION_ENABLED` by default and
`TypeUtil.checkParquetType` has an `isSpark40Plus` guard for INT96), the
relevant tests use `if (CometSparkSessionExtensions.isSpark40Plus) ...` to keep
assertions accurate.
Behavior matrix (excerpt):
```
Case Spark 3.4 3.5 4.0 Comet
native_datafusion Comet native_iceberg_compat
1. BINARY -> TIMESTAMP throw throw throw throw
throw
2. INT32 -> INT64 throw throw OK OK (widened
values) throw on 3.x / OK on 4.0
3. INT96 LTZ -> TIMESTAMP_NTZ throw throw throw OK (silent,
possible wall-clock diff) throw on 3.x / OK on 4.0
4. Decimal(10,2) -> Decimal(5,0) throw throw throw OK (reads,
values unverified) throw
5. INT32 -> INT64 w/ rowgroup filter throw throw OK OK (1 row,
no overflow) throw on 3.x / OK on 4.0
6. STRING -> INT throw throw throw OK (garbage
values) throw
7. TIMESTAMP_NTZ -> ARRAY<...> throw throw throw throw
throw
C1. INT8 -> INT32 OK OK OK OK (widened
values) OK (widened values)
C2. FLOAT -> DOUBLE OK OK OK OK (widened
values) throw on 3.x / OK on 4.0
```
Notable findings the suite captures:
- `native_iceberg_compat` is consistently strict on Spark 3.x (uses Comet's
`TypeUtil.checkParquetType`); on Spark 4.0 it relaxes for value-preserving
widenings because `COMET_SCHEMA_EVOLUTION_ENABLED` defaults to true.
- `native_datafusion` rejects some structural mismatches (binary as
timestamp, timestamp_ntz as array) but silently accepts others (decimal
precision narrowing, string read as int producing garbage values). The
string-as-int case is a real correctness gap worth a separate follow-up.
- The Case 5 row group skipping regression test passes on
`native_datafusion`: filters on widened columns do not overflow.
## How are these changes tested?
The new suite is the test. Verified locally under Spark 3.4, 3.5, and 4.0
profiles (all 18 tests pass under each).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]