andygrove opened a new issue, #4088:
URL: https://github.com/apache/datafusion-comet/issues/4088
## Description
When the `native_datafusion` scan reads a Parquet column whose physical type
is `BINARY` (STRING) under a requested read schema of `INT`, it silently
reinterprets the BINARY bytes as raw INT32 bytes and returns garbage values.
Spark's vectorized reader throws on this mismatch on all supported versions, so
this is a correctness gap (returns wrong answers without an error) rather than
a strict-mode parity gap.
## Reproduction
```scala
withSQLConf(
CometConf.COMET_NATIVE_SCAN_IMPL.key -> CometConf.SCAN_NATIVE_DATAFUSION,
SQLConf.USE_V1_SOURCE_LIST.key -> "parquet") {
withTempPath { dir =>
val path = dir.getCanonicalPath
Seq("a", "b", "c").toDF("c").write.parquet(path)
val df = spark.read.schema("c int").parquet(path)
df.show() // returns 3 rows of meaningless integers; should throw
}
}
```
`native_iceberg_compat` correctly throws `SparkException` for this case
(matches Spark).
## Affected versions
All supported Spark profiles (3.4, 3.5, 4.0). Reproduced on Comet `main`
while building #4087.
## Expected behavior
The native reader should detect that the requested type (`INT`) is not
byte-compatible with the physical column type (`BINARY`/UTF8) and raise an
exception, matching Spark's `SchemaColumnConvertNotSupportedException`.
## Test coverage
Documented in `ParquetSchemaMismatchSuite` (added in #4087) under the test
name `string read as int: native_datafusion`. The test currently asserts the
buggy behavior so future fixes will need to update the assertion (and the
matrix in the file header) when this is resolved.
## Parent issue
Split from #3720.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]