[PR] feat: support Parquet field ID matching in native_datafusion scan [datafusion-comet]

via GitHub Mon, 04 May 2026 14:26:31 -0700


mbutrovich opened a new pull request, #4216:
URL: https://github.com/apache/datafusion-comet/pull/4216


   ## Which issue does this PR close?
   
   Closes #3434. Closes #4189.
   
   ## Rationale for this change
   
   The `native_datafusion` scan was the only Comet read path that did not honor 
Parquet field IDs. When `spark.sql.parquet.fieldId.read.enabled=true` and the 
requested schema carried IDs, `CometScanRule` fell back to Spark's reader 
(added in #3415). The legacy V1 reader and `native_iceberg_compat` already 
handle field-ID matching, and the issue notes that this gap also blocks the 
Delta path because Delta strips field-ID metadata from `requiredSchema` before 
it reaches Comet.
   
   This change resolves columns by `parquet.field.id` end-to-end on the 
`native_datafusion` path, matching Spark's 
`ParquetReadSupport.clipParquetGroupFields` precedence rules so behavior stays 
consistent with the V1 reader.
   
   ## What changes are included in this PR?
   
   - **Drop the fallback gate** in `CometScanRule.transformV1Scan` that forced 
Spark to handle reads with field IDs.
   - **Propagate field IDs through the proto** by adding a `metadata` map on 
`SparkStructField` and a parallel `field_metadata` on `StructInfo`, plus 
`use_field_id`/`ignore_missing_field_id` flags on `NativeScanCommon`. The JVM 
serde emits `parquet.field.id` under arrow-rs's `PARQUET_FIELD_ID_META_KEY` so 
both sides agree on the key. A new 
`CometParquetUtils.PARQUET_FIELD_ID_META_KEY` constant centralizes the string.
   - **Match Spark's per-field precedence on the native side**: at the top 
level, `SparkPhysicalExprAdapterFactory` extends its existing case-insensitive 
remap with ID-first matching; at nested struct levels, 
`parquet_convert_struct_to_struct` does the same. ID-bearing logical fields 
match only by ID (a missing ID resolves to NULL); non-ID-bearing fields fall 
back to name. This mirrors `clipParquetGroupFields`.
   - **Common case stays free**: the new flag is set only when both 
`spark.sql.parquet.fieldId.read.enabled=true` and the requested schema actually 
carries IDs. When false, all paths take the same code as before.
   
   ## How are these changes tested?
   
   Two new tests in `ParquetReadSuite` exercise the path under 
`SCAN_NATIVE_DATAFUSION` with `checkSparkAnswerAndOperator`:
   - a primitive case where the read-schema names differ from the file's,
   - a nested case (struct + array<struct> + map) with renames at every level.
   
   Spark's `ParquetFieldIdIOSuite` and `ParquetFieldIdSchemaSuite` are already 
unignored in `dev/diffs/{3.4.3,3.5.8,4.0.2,4.1.1}.diff` and currently pass via 
the Spark fallback; with this change they exercise the native path.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat: support Parquet field ID matching in native_datafusion scan [datafusion-comet]

Reply via email to