andygrove opened a new pull request, #4091:
URL: https://github.com/apache/datafusion-comet/pull/4091

   ## Which issue does this PR close?
   
   Closes #4088.
   
   ## Rationale for this change
   
   When the `native_datafusion` scan reads a Parquet `BINARY` (UTF8) column 
under a numeric read schema, the existing schema adapter creates a Spark `Cast` 
with `is_adapting_schema=true`. In that mode `Cast` delegates to DataFusion's 
cast, which parses the bytes (returning null on non-numeric strings, or in some 
paths reinterpreting the raw bytes). Spark's vectorized reader rejects this 
kind of mismatch with `SchemaColumnConvertNotSupportedException` on every 
supported version, and `native_iceberg_compat` already does the same via 
`TypeUtil.checkParquetType`. The native scan should match.
   
   ## What changes are included in this PR?
   
   `native/core/src/parquet/schema_adapter.rs`: in `replace_with_spark_cast`, 
add a guard before the existing branches that returns `DataFusionError::Plan` 
when the source type is `Utf8`, `LargeUtf8`, `Binary`, or `LargeBinary` and the 
target type is any integer (`Int8`/`Int16`/`Int32`/`Int64`/`UInt*`) or 
floating-point type (`Float32`/`Float64`). The rule mirrors 
`TypeUtil.checkParquetType`'s `BINARY` case (lines 208-221), which only allows 
reading BINARY as `StringType`, `BinaryType`, or a binary-encoded decimal.
   
   The check is intentionally narrow: it only fires for string/binary -> 
numeric mismatches and leaves every other type path unchanged.
   
   ## How are these changes tested?
   
   Added a focused test to `ParquetReadSuite`: `native_datafusion rejects 
string read as numeric`. It writes string data, reads it under `c int`, forces 
`spark.comet.scan.impl=native_datafusion` and 
`spark.sql.sources.useV1SourceList=parquet`, and asserts that `collect()` 
raises `SparkException`. Verified against `ParquetReadV1Suite` (44 succeeded, 
no regressions; 1 pre-existing test ignored).
   
   The behavior is also covered by the per-impl matrix added in #4087 (`string 
read as int: native_datafusion`), whose assertion will need flipping from 
"succeeds with garbage" to "throws" once that PR merges.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to