andygrove opened a new issue, #4088:
URL: https://github.com/apache/datafusion-comet/issues/4088

   ## Description
   
   When the `native_datafusion` scan reads a Parquet column whose physical type 
is `BINARY` (STRING) under a requested read schema of `INT`, it silently 
reinterprets the BINARY bytes as raw INT32 bytes and returns garbage values. 
Spark's vectorized reader throws on this mismatch on all supported versions, so 
this is a correctness gap (returns wrong answers without an error) rather than 
a strict-mode parity gap.
   
   ## Reproduction
   
   ```scala
   withSQLConf(
     CometConf.COMET_NATIVE_SCAN_IMPL.key -> CometConf.SCAN_NATIVE_DATAFUSION,
     SQLConf.USE_V1_SOURCE_LIST.key -> "parquet") {
     withTempPath { dir =>
       val path = dir.getCanonicalPath
       Seq("a", "b", "c").toDF("c").write.parquet(path)
       val df = spark.read.schema("c int").parquet(path)
       df.show() // returns 3 rows of meaningless integers; should throw
     }
   }
   ```
   
   `native_iceberg_compat` correctly throws `SparkException` for this case 
(matches Spark).
   
   ## Affected versions
   
   All supported Spark profiles (3.4, 3.5, 4.0). Reproduced on Comet `main` 
while building #4087.
   
   ## Expected behavior
   
   The native reader should detect that the requested type (`INT`) is not 
byte-compatible with the physical column type (`BINARY`/UTF8) and raise an 
exception, matching Spark's `SchemaColumnConvertNotSupportedException`.
   
   ## Test coverage
   
   Documented in `ParquetSchemaMismatchSuite` (added in #4087) under the test 
name `string read as int: native_datafusion`. The test currently asserts the 
buggy behavior so future fixes will need to update the assertion (and the 
matrix in the file header) when this is resolved.
   
   ## Parent issue
   
   Split from #3720.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to