mbutrovich opened a new pull request, #4216:
URL: https://github.com/apache/datafusion-comet/pull/4216
## Which issue does this PR close?
Closes #3434. Closes #4189.
## Rationale for this change
The `native_datafusion` scan was the only Comet read path that did not honor
Parquet field IDs. When `spark.sql.parquet.fieldId.read.enabled=true` and the
requested schema carried IDs, `CometScanRule` fell back to Spark's reader
(added in #3415). The legacy V1 reader and `native_iceberg_compat` already
handle field-ID matching, and the issue notes that this gap also blocks the
Delta path because Delta strips field-ID metadata from `requiredSchema` before
it reaches Comet.
This change resolves columns by `parquet.field.id` end-to-end on the
`native_datafusion` path, matching Spark's
`ParquetReadSupport.clipParquetGroupFields` precedence rules so behavior stays
consistent with the V1 reader.
## What changes are included in this PR?
- **Drop the fallback gate** in `CometScanRule.transformV1Scan` that forced
Spark to handle reads with field IDs.
- **Propagate field IDs through the proto** by adding a `metadata` map on
`SparkStructField` and a parallel `field_metadata` on `StructInfo`, plus
`use_field_id`/`ignore_missing_field_id` flags on `NativeScanCommon`. The JVM
serde emits `parquet.field.id` under arrow-rs's `PARQUET_FIELD_ID_META_KEY` so
both sides agree on the key. A new
`CometParquetUtils.PARQUET_FIELD_ID_META_KEY` constant centralizes the string.
- **Match Spark's per-field precedence on the native side**: at the top
level, `SparkPhysicalExprAdapterFactory` extends its existing case-insensitive
remap with ID-first matching; at nested struct levels,
`parquet_convert_struct_to_struct` does the same. ID-bearing logical fields
match only by ID (a missing ID resolves to NULL); non-ID-bearing fields fall
back to name. This mirrors `clipParquetGroupFields`.
- **Common case stays free**: the new flag is set only when both
`spark.sql.parquet.fieldId.read.enabled=true` and the requested schema actually
carries IDs. When false, all paths take the same code as before.
## How are these changes tested?
Two new tests in `ParquetReadSuite` exercise the path under
`SCAN_NATIVE_DATAFUSION` with `checkSparkAnswerAndOperator`:
- a primitive case where the read-schema names differ from the file's,
- a nested case (struct + array<struct> + map) with renames at every level.
Spark's `ParquetFieldIdIOSuite` and `ParquetFieldIdSchemaSuite` are already
unignored in `dev/diffs/{3.4.3,3.5.8,4.0.2,4.1.1}.diff` and currently pass via
the Spark fallback; with this change they exercise the native path.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]