andygrove opened a new pull request, #4117:
URL: https://github.com/apache/datafusion-comet/pull/4117

   ## Which issue does this PR close?
   
   Partially addresses #3720.
   
   ## Rationale for this change
   
   Spark's vectorized Parquet reader signals incompatible column reads (e.g. 
reading a STRING column as INT, scale-narrowing decimal, scalar read as ARRAY) 
with `SchemaColumnConvertNotSupportedException`. `FileScanRDD` (3.x) and 
`FileDataSourceV2` (4.0) then wrap that into a typed `SparkException` 
(`_LEGACY_ERROR_TEMP_2063` on 3.x, 
`FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH` on 4.0).
   
   `native_datafusion` was raising the same incompatible-read cases (after 
#4090 and #4091) but as opaque `CometNativeException`, so several Spark SQL 
tests that assert the typed exception chain or message format had to be ignored 
under `IgnoreCometNativeDataFusion("...3720")`. This PR aligns 
native_datafusion's error path with `native_iceberg_compat`'s, so the same 
tests pass without changing user-visible semantics for the cases that already 
errored.
   
   ## What changes are included in this PR?
   
   **Native side**
   
   - New `SparkError::ParquetSchemaConvert { file_path, column, physical_type, 
spark_type }` variant in `native/common/src/error.rs`, wired into 
`error_type_name`, `params_as_json`, and `exception_class`. Flows through the 
existing `DataFusionError::External` → `CometQueryExecutionException` JSON 
pipeline.
   - `native/core/src/parquet/schema_adapter.rs`: the two 
`DataFusionError::Plan(...)` returns added by #4090 and #4091 now return the 
new variant. Added a planning-time guard for scalar/complex mismatch (e.g. 
`Timestamp` read as `Array<Timestamp>`) — covers SPARK-45604.
   
   **JVM shims** (3.4 / 3.5 / 4.0 versions of `ShimSparkErrorConverter.scala`)
   
   - New `case "ParquetSchemaConvert"` translates the JSON-encoded native error 
to a `SchemaColumnConvertNotSupportedException`, then wraps it via the 
version-appropriate `QueryExecutionErrors`:
     - 3.4 / 3.5: `unsupportedSchemaColumnConvertError` → 
`_LEGACY_ERROR_TEMP_2063`
     - 4.0: `parquetColumnDataTypeMismatchError` → 
`FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH`
   - File path is currently passed empty since DataFusion's 
`PhysicalExprAdapterFactory::create` doesn't expose it. The wrapped message 
still satisfies the `errMsg.contains("Parquet column cannot be converted in 
file")` assertions.
   
   **Spark SQL diffs**
   
   Tests previously ignored under `IgnoreCometNativeDataFusion("...3720")` are 
unignored where they now pass:
   
   - `dev/diffs/3.5.8.diff`: SPARK-35640 binary as timestamp, SPARK-45604 ntz 
to array<ntz> (verified locally end-to-end on Spark v3.5.8 with 
`ENABLE_COMET=true ENABLE_COMET_ONHEAP=true`).
   - `dev/diffs/3.4.3.diff`: same two tests (parallel structure with 3.5; same 
shim code path).
   - `dev/diffs/4.0.2.diff`: SPARK-45604 ntz to array<ntz> only. SPARK-35640 
binary as timestamp on 4.0 uses `checkErrorMatchPVals` with strict parameter 
format (`column="[_1]"`, `actualType="BINARY"`, `expectedType="timestamp"`); 
the shim currently passes Arrow type names (`Utf8`, `Timestamp(µs, "UTC")`) 
without brackets and would need an Arrow-to-Parquet/Spark type-name translation 
step. Left ignored as a follow-up.
   
   ## How are these changes tested?
   
   End-to-end against `apache/spark` v3.5.8 and v4.0.2 with the Comet jar and 
`ENABLE_COMET=true ENABLE_COMET_ONHEAP=true`:
   
   | Test | 3.5.8 | 4.0.2 |
   |---|---|---|
   | `ParquetIOSuite` SPARK-35640 binary as timestamp | pass | fail (param 
format, see above) |
   | `ParquetSchemaSuite` SPARK-45604 ntz to array<ntz> | pass | pass |
   | `ParquetSchemaSuite` schema mismatch failure error message vectorized | 
fail (test extracts file path from message and re-reads — needs file-path 
plumbing through `PhysicalExprAdapterFactory::create`, separate follow-up) | 
not exercised |
   
   Existing Comet regression tests added by #4090 and #4091 (`ParquetReadSuite` 
"native_datafusion rejects string read as non-string/binary type" and 
"native_datafusion rejects incompatible decimal precision/scale") continue to 
pass — they assert `assertThrows[SparkException]`, which the new wrapped 
`SparkException` satisfies.
   
   `cargo clippy --all-targets --workspace -- -D warnings` clean.
   
   ## Out of scope (follow-ups)
   
   - 3.5.8 vectorized schema mismatch test (requires file-path plumbing).
   - 4.0.2 binary-as-timestamp test (requires Arrow-to-Parquet/Spark type-name 
translation in shim).
   - 3.4.3 end-to-end run (parallel to 3.5.8, not separately verified).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to