andygrove commented on issue #3720:
URL:
https://github.com/apache/datafusion-comet/issues/3720#issuecomment-4320313542
Added a permanent suite documenting the per-case, per-scan-impl,
per-Spark-version behavior in #4087 (`ParquetSchemaMismatchSuite`). The suite
encodes current behavior rather than fixing it, so the issue stays open.
Highlights from running the matrix locally on Spark 3.4 / 3.5 / 4.0:
```
Case Spark 3.4 3.5 4.0 Comet
native_datafusion Comet native_iceberg_compat
1. BINARY -> TIMESTAMP throw throw throw throw
throw
2. INT32 -> INT64 throw throw OK OK (widened
values) throw on 3.x / OK on 4.0
3. INT96 LTZ -> TIMESTAMP_NTZ throw throw throw OK (silent,
possible wall-clock diff) throw on 3.x / OK on 4.0
4. Decimal(10,2) -> Decimal(5,0) throw throw throw OK (reads,
values unverified) throw
5. INT32 -> INT64 w/ rowgroup filter throw throw OK OK (1 row,
no overflow) throw on 3.x / OK on 4.0
6. STRING -> INT throw throw throw OK (garbage
values) throw
7. TIMESTAMP_NTZ -> ARRAY<...> throw throw throw throw
throw
C1. INT8 -> INT32 (control) OK OK OK OK (widened
values) OK (widened values)
C2. FLOAT -> DOUBLE (control) OK OK OK OK (widened
values) throw on 3.x / OK on 4.0
```
Observations relevant to the original issue list:
- **Binary read as timestamp:** both scan impls now throw `SparkException`
(case 1). `native_datafusion` raises `CometNativeException` ("column types must
match schema types"), `native_iceberg_compat` raises
`SchemaColumnConvertNotSupportedException`. Behavior aligns with Spark on all
versions.
- **INT32 read as INT64:** `native_datafusion` widens silently on all Spark
versions (matches Spark 4.0, diverges from Spark 3.x). `native_iceberg_compat`
throws on 3.x and accepts on 4.0 because `COMET_SCHEMA_EVOLUTION_ENABLED`
defaults to true on 4.0.
- **TimestampLTZ read as TimestampNTZ:** `native_datafusion` cannot detect
the mismatch (INT96 carries no timezone metadata) and silently reads, possibly
with a wrong wall-clock value. `native_iceberg_compat` throws on 3.x, but on
4.0 the `isSpark40Plus` guard in `TypeUtil.checkParquetType` bypasses the INT96
check and the read silently succeeds.
- **Decimal precision/scale (10,2 -> 5,0):** `native_datafusion` succeeds
without value validation (returned values were not verified by the test);
`native_iceberg_compat` throws on all versions.
- **String read as INT:** `native_datafusion` silently reinterprets the
BINARY bytes of each string as INT32 garbage values without throwing on any
Spark version. This is a real correctness gap (returns wrong answers without an
error). `native_iceberg_compat` throws.
- **timestamp_ntz read as array<timestamp_ntz>:** both scan impls throw on
all versions (matches Spark).
- **Row group skipping regression guard (case 5):** `native_datafusion`
returns the correct single row when reading INT32 as INT64 with a filter
constant beyond INT32 max, so no overflow occurs.
Two cases worth follow-up issues if not already tracked:
1. `native_datafusion` reading STRING as INT silently returns garbage. The
vectorized native reader does not validate that the requested type is
byte-compatible with the physical column.
2. `native_datafusion` reading higher-precision decimal as lower-precision
decimal succeeds without value validation; whether the values are correct under
truncation/rounding rules has not been verified.
The suite will surface either of these via test failure if the underlying
behavior changes (intentional fix or accidental regression). When a fix lands,
update the affected test(s) and the matrix in the same PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]