andygrove opened a new pull request, #4117:
URL: https://github.com/apache/datafusion-comet/pull/4117
## Which issue does this PR close?
Partially addresses #3720.
## Rationale for this change
Spark's vectorized Parquet reader signals incompatible column reads (e.g.
reading a STRING column as INT, scale-narrowing decimal, scalar read as ARRAY)
with `SchemaColumnConvertNotSupportedException`. `FileScanRDD` (3.x) and
`FileDataSourceV2` (4.0) then wrap that into a typed `SparkException`
(`_LEGACY_ERROR_TEMP_2063` on 3.x,
`FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH` on 4.0).
`native_datafusion` was raising the same incompatible-read cases (after
#4090 and #4091) but as opaque `CometNativeException`, so several Spark SQL
tests that assert the typed exception chain or message format had to be ignored
under `IgnoreCometNativeDataFusion("...3720")`. This PR aligns
native_datafusion's error path with `native_iceberg_compat`'s, so the same
tests pass without changing user-visible semantics for the cases that already
errored.
## What changes are included in this PR?
**Native side**
- New `SparkError::ParquetSchemaConvert { file_path, column, physical_type,
spark_type }` variant in `native/common/src/error.rs`, wired into
`error_type_name`, `params_as_json`, and `exception_class`. Flows through the
existing `DataFusionError::External` → `CometQueryExecutionException` JSON
pipeline.
- `native/core/src/parquet/schema_adapter.rs`: the two
`DataFusionError::Plan(...)` returns added by #4090 and #4091 now return the
new variant. Added a planning-time guard for scalar/complex mismatch (e.g.
`Timestamp` read as `Array<Timestamp>`) — covers SPARK-45604.
**JVM shims** (3.4 / 3.5 / 4.0 versions of `ShimSparkErrorConverter.scala`)
- New `case "ParquetSchemaConvert"` translates the JSON-encoded native error
to a `SchemaColumnConvertNotSupportedException`, then wraps it via the
version-appropriate `QueryExecutionErrors`:
- 3.4 / 3.5: `unsupportedSchemaColumnConvertError` →
`_LEGACY_ERROR_TEMP_2063`
- 4.0: `parquetColumnDataTypeMismatchError` →
`FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH`
- File path is currently passed empty since DataFusion's
`PhysicalExprAdapterFactory::create` doesn't expose it. The wrapped message
still satisfies the `errMsg.contains("Parquet column cannot be converted in
file")` assertions.
**Spark SQL diffs**
Tests previously ignored under `IgnoreCometNativeDataFusion("...3720")` are
unignored where they now pass:
- `dev/diffs/3.5.8.diff`: SPARK-35640 binary as timestamp, SPARK-45604 ntz
to array<ntz> (verified locally end-to-end on Spark v3.5.8 with
`ENABLE_COMET=true ENABLE_COMET_ONHEAP=true`).
- `dev/diffs/3.4.3.diff`: same two tests (parallel structure with 3.5; same
shim code path).
- `dev/diffs/4.0.2.diff`: SPARK-45604 ntz to array<ntz> only. SPARK-35640
binary as timestamp on 4.0 uses `checkErrorMatchPVals` with strict parameter
format (`column="[_1]"`, `actualType="BINARY"`, `expectedType="timestamp"`);
the shim currently passes Arrow type names (`Utf8`, `Timestamp(µs, "UTC")`)
without brackets and would need an Arrow-to-Parquet/Spark type-name translation
step. Left ignored as a follow-up.
## How are these changes tested?
End-to-end against `apache/spark` v3.5.8 and v4.0.2 with the Comet jar and
`ENABLE_COMET=true ENABLE_COMET_ONHEAP=true`:
| Test | 3.5.8 | 4.0.2 |
|---|---|---|
| `ParquetIOSuite` SPARK-35640 binary as timestamp | pass | fail (param
format, see above) |
| `ParquetSchemaSuite` SPARK-45604 ntz to array<ntz> | pass | pass |
| `ParquetSchemaSuite` schema mismatch failure error message vectorized |
fail (test extracts file path from message and re-reads — needs file-path
plumbing through `PhysicalExprAdapterFactory::create`, separate follow-up) |
not exercised |
Existing Comet regression tests added by #4090 and #4091 (`ParquetReadSuite`
"native_datafusion rejects string read as non-string/binary type" and
"native_datafusion rejects incompatible decimal precision/scale") continue to
pass — they assert `assertThrows[SparkException]`, which the new wrapped
`SparkException` satisfies.
`cargo clippy --all-targets --workspace -- -D warnings` clean.
## Out of scope (follow-ups)
- 3.5.8 vectorized schema mismatch test (requires file-path plumbing).
- 4.0.2 binary-as-timestamp test (requires Arrow-to-Parquet/Spark type-name
translation in shim).
- 3.4.3 end-to-end run (parallel to 3.5.8, not separately verified).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]