andygrove opened a new pull request, #4084:
URL: https://github.com/apache/datafusion-comet/pull/4084

   ## Which issue does this PR close?
   
   Closes #2209.
   
   ## Rationale for this change
   
   Spark 4.0's `PushVariantIntoScan` optimizer rewrites a `VariantType` column 
into a `StructType` whose fields each carry `__VARIANT_METADATA_KEY` metadata, 
then pushes `variant_get` paths down as ordinary struct field accesses. By the 
time `CometScanRule` runs, the `requiredSchema` looks like a normal struct of 
primitives, so Comet scans natively but does not honor the on-disk Parquet 
variant shredding layout. The result is silent data corruption: typed paths 
read back as nulls (or, with some shapes, a hard `FAILED_READ_FILE`).
   
   This is data correctness, so we need to fall back to Spark for these reads 
rather than continuing to ignore the suites.
   
   ## What changes are included in this PR?
   
   - `CometTypeShim` (Spark 4.0): add `isVariantStruct` that delegates to 
Spark's `VariantMetadata.isVariantStruct`, which checks for the 
`__VARIANT_METADATA_KEY` marker on every field.
   - `CometTypeShim` (Spark 3.x): stub returning false; variant shredding does 
not exist pre-4.0.
   - `CometScanTypeChecker.isTypeSupported`: add a `case s: StructType if 
isVariantStruct(s) => false` arm with a fallback reason, so both 
`auto`/`native_datafusion` and `native_iceberg_compat` scan paths fall back to 
Spark on shredded Variant reads.
   - `dev/diffs/4.0.1.diff`: stop ignoring `VariantShreddingSuite` and 
`ParquetVariantShreddingSuite`.
   
   ## How are these changes tested?
   
   Ran `sql/testOnly org.apache.spark.sql.VariantShreddingSuite 
org.apache.spark.sql.execution.datasources.parquet.ParquetVariantShreddingSuite`
 against patched Spark v4.0.1 with `ENABLE_COMET=true ENABLE_COMET_ONHEAP=true`:
   
   - `COMET_PARQUET_SCAN_IMPL=auto`: 13/13 pass (was 5/13).
   - `COMET_PARQUET_SCAN_IMPL=native_iceberg_compat`: 13/13 pass (was 5/13).
   
   The Spark SQL test workflows already cover both scan impls on every PR, so 
the unignored suites give us ongoing protection against regressions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to