andygrove opened a new pull request, #4090:
URL: https://github.com/apache/datafusion-comet/pull/4090

   ## Which issue does this PR close?
   
   Closes #4089.
   
   ## Rationale for this change
   
   When the `native_datafusion` scan reads a Parquet column with a 
higher-precision decimal physical type under a requested read schema with 
smaller precision or different scale, the existing schema adapter falls through 
to Spark's `Cast` expression. `Cast` happily rescales or truncates, producing 
wrong values silently. Spark's vectorized reader rejects this with 
`SchemaColumnConvertNotSupportedException`, and `native_iceberg_compat` already 
does the same via `TypeUtil.checkParquetType`. The native scan should match.
   
   ## What changes are included in this PR?
   
   `native/core/src/parquet/schema_adapter.rs`: in `replace_with_spark_cast`, 
add a guard before the existing branches that returns a `DataFusionError::Plan` 
when both `physical_type` and `target_type` are `Decimal128` and either the 
target precision is smaller than the source precision or the scales differ. The 
rule mirrors Spark's `TypeUtil.isDecimalTypeMatched` (Spark 3.x path).
   
   The check is intentionally narrow: it only fires for decimal-to-decimal 
mismatches, leaving every other type path unchanged. The closed prior attempt 
at broad schema validation in #3311 broke unrelated tests; this one does not.
   
   Spark 4.0's decimal type widening (allowing strictly-larger target 
precision/scale per SPARK-37147) is not implemented here. For #4089's specific 
case (`Decimal(10,2)` read as `Decimal(5,0)`) Spark 4.0 also rejects, so 
behavior is correct on all supported versions. Implementing 4.0's wider 
widening is a follow-up.
   
   ## How are these changes tested?
   
   Added a focused test to `ParquetReadSuite`: `native_datafusion rejects 
incompatible decimal precision/scale`. It writes `Decimal(10, 2)` data, reads 
it under `Decimal(5, 0)`, forces `spark.comet.scan.impl=native_datafusion` and 
`spark.sql.sources.useV1SourceList=parquet`, and asserts that `collect()` 
raises `SparkException`. Verified against `ParquetReadV1Suite` (44 tests, all 
pass; 1 pre-existing test ignored). The behavior is also covered by the 
per-impl matrix added in #4087, which will need its `decimal(10,2) read as 
decimal(5,0): native_datafusion` assertion flipped from "succeeds" to "throws" 
in a follow-up once that PR merges.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to