andygrove opened a new pull request, #4087:
URL: https://github.com/apache/datafusion-comet/pull/4087

   ## Which issue does this PR close?
   
   Part of #3720 (does not close it; this PR encodes current behavior rather 
than fixing divergences).
   
   ## Rationale for this change
   
   Issue #3720 lists several Parquet schema-mismatch cases where Comet diverges 
from Spark, but no permanent test captures the per-case, per-scan-impl, 
per-Spark-version behavior. The divergences are tracked only in the issue text 
and a couple of ad-hoc tests in `ParquetReadSuite`.
   
   Adding a focused suite gives us:
   
   - a regression guard for every case in #3720,
   - a behavior matrix that shows at a glance how `native_datafusion` and 
`native_iceberg_compat` differ from Spark on each Spark version, and
   - a concrete reference future fixes can target (and the test+matrix to 
update when a fix lands).
   
   ## What changes are included in this PR?
   
   A new test suite at 
`spark/src/test/scala/org/apache/comet/parquet/ParquetSchemaMismatchSuite.scala`
 with 18 tests covering 9 cases (7 from #3720 plus 2 harmless control 
widenings) under the two Comet scan implementations (`native_datafusion`, 
`native_iceberg_compat`).
   
   Each test asserts Comet's actual current behavior. Spark's reference 
behavior is documented in per-case comments and in a behavior matrix at the top 
of the file. Where Comet's behavior depends on the Spark version (Spark 4.0 
enables `COMET_SCHEMA_EVOLUTION_ENABLED` by default and 
`TypeUtil.checkParquetType` has an `isSpark40Plus` guard for INT96), the 
relevant tests use `if (CometSparkSessionExtensions.isSpark40Plus) ...` to keep 
assertions accurate.
   
   Behavior matrix (excerpt):
   
   ```
   Case                                   Spark 3.4  3.5    4.0    Comet 
native_datafusion              Comet native_iceberg_compat
   1. BINARY -> TIMESTAMP                 throw      throw  throw  throw        
                        throw
   2. INT32 -> INT64                      throw      throw  OK     OK (widened 
values)                  throw on 3.x / OK on 4.0
   3. INT96 LTZ -> TIMESTAMP_NTZ          throw      throw  throw  OK (silent, 
possible wall-clock diff) throw on 3.x / OK on 4.0
   4. Decimal(10,2) -> Decimal(5,0)       throw      throw  throw  OK (reads, 
values unverified)        throw
   5. INT32 -> INT64 w/ rowgroup filter   throw      throw  OK     OK (1 row, 
no overflow)              throw on 3.x / OK on 4.0
   6. STRING -> INT                       throw      throw  throw  OK (garbage 
values)                  throw
   7. TIMESTAMP_NTZ -> ARRAY<...>         throw      throw  throw  throw        
                        throw
   C1. INT8 -> INT32                      OK         OK     OK     OK (widened 
values)                  OK (widened values)
   C2. FLOAT -> DOUBLE                    OK         OK     OK     OK (widened 
values)                  throw on 3.x / OK on 4.0
   ```
   
   Notable findings the suite captures:
   
   - `native_iceberg_compat` is consistently strict on Spark 3.x (uses Comet's 
`TypeUtil.checkParquetType`); on Spark 4.0 it relaxes for value-preserving 
widenings because `COMET_SCHEMA_EVOLUTION_ENABLED` defaults to true.
   - `native_datafusion` rejects some structural mismatches (binary as 
timestamp, timestamp_ntz as array) but silently accepts others (decimal 
precision narrowing, string read as int producing garbage values). The 
string-as-int case is a real correctness gap worth a separate follow-up.
   - The Case 5 row group skipping regression test passes on 
`native_datafusion`: filters on widened columns do not overflow.
   
   ## How are these changes tested?
   
   The new suite is the test. Verified locally under Spark 3.4, 3.5, and 4.0 
profiles (all 18 tests pass under each).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to