andygrove opened a new pull request, #4190:
URL: https://github.com/apache/datafusion-comet/pull/4190

   ## Which issue does this PR close?
   
   Closes #4136.
   
   ## Rationale for this change
   
   Spark 4.1 (SPARK-53535) added 
`LEGACY_PARQUET_RETURN_NULL_STRUCT_IF_ALL_FIELDS_MISSING`. With the new default 
(`false`), Spark's vectorized reader appends a "marker" leaf field to the 
Parquet read schema so it can distinguish a null parent row from a non-null 
parent whose requested fields are all missing in the file, then truncates the 
marker out of the `ColumnarBatch`.
   
   Comet's `parquet_convert_struct_to_struct` (in 
`native/core/src/parquet/parquet_support.rs`) unconditionally collapsed a 
non-overlapping struct to a fully-null array via 
`NullBuffer::new_null(array.len())`, conflating the two cases. With the new 
Spark 4.1 default this caused 5 `ParquetIOSuite` tests to fail because non-null 
parent rows came back as `Row(null)` instead of `Row(Row(null, null))`.
   
   ## What changes are included in this PR?
   
   Adds `return_null_struct_if_all_fields_missing` to `SparkParquetOptions` 
(default `true` for backwards compat) and plumbs it through both scan paths:
   
   - **native_datafusion**: read in `CometNativeScan.convert` from the Spark 
conf, sent via a new `NativeScanCommon` proto field, threaded through 
`init_datasource_exec`.
   - **native_iceberg_compat**: read in `CometParquetFileFormat`, passed 
through the `NativeBatchReader` constructor and a new `initRecordBatchReader` 
JNI parameter.
   
   For both paths the JVM defaults to `"true"` on Spark < 4.1 (legacy hardcoded 
behavior) and `"false"` on Spark 4.1+ (matches Spark's registered conf 
default), with explicit user settings honored. When the flag is `false` and no 
requested fields overlap with the file's struct, 
`parquet_convert_struct_to_struct` now preserves `array.nulls()` so the parent 
struct's nullness from the file is propagated.
   
   The 5 `IgnoreComet` annotations for the SPARK-53535 / SPARK-54220 tests are 
removed from `dev/diffs/4.1.1.diff`.
   
   ## How are these changes tested?
   
   - New test `issue #4136: struct with all requested fields missing in file` 
in `CometNativeReaderSuite`, parameterized over both `native_datafusion` and 
`native_iceberg_compat`. The test writes a Parquet file with `struct<_1: 
struct<_1: int, _2: string>>` containing a mix of null and non-null parent 
rows, reads with a schema requesting `struct<_1: struct<_3: int, _4: long>>` 
(all inner fields missing in the file), and verifies the result matches Spark's 
vectorized reader for both `legacy=true` and `legacy=false` modes. Both impls 
fail before this change, both pass after. Full `CometNativeReaderSuite` (100 
tests) passes.
   - The 5 `ParquetIOSuite` tests un-ignored in `dev/diffs/4.1.1.diff` will run 
in the Spark SQL CI workflow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to