andygrove commented on issue #3720:
URL: 
https://github.com/apache/datafusion-comet/issues/3720#issuecomment-4320313542

   Added a permanent suite documenting the per-case, per-scan-impl, 
per-Spark-version behavior in #4087 (`ParquetSchemaMismatchSuite`). The suite 
encodes current behavior rather than fixing it, so the issue stays open. 
Highlights from running the matrix locally on Spark 3.4 / 3.5 / 4.0:
   
   ```
   Case                                   Spark 3.4  3.5    4.0    Comet 
native_datafusion              Comet native_iceberg_compat
   1. BINARY -> TIMESTAMP                 throw      throw  throw  throw        
                        throw
   2. INT32 -> INT64                      throw      throw  OK     OK (widened 
values)                  throw on 3.x / OK on 4.0
   3. INT96 LTZ -> TIMESTAMP_NTZ          throw      throw  throw  OK (silent, 
possible wall-clock diff) throw on 3.x / OK on 4.0
   4. Decimal(10,2) -> Decimal(5,0)       throw      throw  throw  OK (reads, 
values unverified)        throw
   5. INT32 -> INT64 w/ rowgroup filter   throw      throw  OK     OK (1 row, 
no overflow)              throw on 3.x / OK on 4.0
   6. STRING -> INT                       throw      throw  throw  OK (garbage 
values)                  throw
   7. TIMESTAMP_NTZ -> ARRAY<...>         throw      throw  throw  throw        
                        throw
   C1. INT8 -> INT32 (control)            OK         OK     OK     OK (widened 
values)                  OK (widened values)
   C2. FLOAT -> DOUBLE (control)          OK         OK     OK     OK (widened 
values)                  throw on 3.x / OK on 4.0
   ```
   
   Observations relevant to the original issue list:
   
   - **Binary read as timestamp:** both scan impls now throw `SparkException` 
(case 1). `native_datafusion` raises `CometNativeException` ("column types must 
match schema types"), `native_iceberg_compat` raises 
`SchemaColumnConvertNotSupportedException`. Behavior aligns with Spark on all 
versions.
   - **INT32 read as INT64:** `native_datafusion` widens silently on all Spark 
versions (matches Spark 4.0, diverges from Spark 3.x). `native_iceberg_compat` 
throws on 3.x and accepts on 4.0 because `COMET_SCHEMA_EVOLUTION_ENABLED` 
defaults to true on 4.0.
   - **TimestampLTZ read as TimestampNTZ:** `native_datafusion` cannot detect 
the mismatch (INT96 carries no timezone metadata) and silently reads, possibly 
with a wrong wall-clock value. `native_iceberg_compat` throws on 3.x, but on 
4.0 the `isSpark40Plus` guard in `TypeUtil.checkParquetType` bypasses the INT96 
check and the read silently succeeds.
   - **Decimal precision/scale (10,2 -> 5,0):** `native_datafusion` succeeds 
without value validation (returned values were not verified by the test); 
`native_iceberg_compat` throws on all versions.
   - **String read as INT:** `native_datafusion` silently reinterprets the 
BINARY bytes of each string as INT32 garbage values without throwing on any 
Spark version. This is a real correctness gap (returns wrong answers without an 
error). `native_iceberg_compat` throws.
   - **timestamp_ntz read as array<timestamp_ntz>:** both scan impls throw on 
all versions (matches Spark).
   - **Row group skipping regression guard (case 5):** `native_datafusion` 
returns the correct single row when reading INT32 as INT64 with a filter 
constant beyond INT32 max, so no overflow occurs.
   
   Two cases worth follow-up issues if not already tracked:
   
   1. `native_datafusion` reading STRING as INT silently returns garbage. The 
vectorized native reader does not validate that the requested type is 
byte-compatible with the physical column.
   2. `native_datafusion` reading higher-precision decimal as lower-precision 
decimal succeeds without value validation; whether the values are correct under 
truncation/rounding rules has not been verified.
   
   The suite will surface either of these via test failure if the underlying 
behavior changes (intentional fix or accidental regression). When a fix lands, 
update the affected test(s) and the matrix in the same PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to