andygrove opened a new pull request, #4190: URL: https://github.com/apache/datafusion-comet/pull/4190
## Which issue does this PR close? Closes #4136. ## Rationale for this change Spark 4.1 (SPARK-53535) added `LEGACY_PARQUET_RETURN_NULL_STRUCT_IF_ALL_FIELDS_MISSING`. With the new default (`false`), Spark's vectorized reader appends a "marker" leaf field to the Parquet read schema so it can distinguish a null parent row from a non-null parent whose requested fields are all missing in the file, then truncates the marker out of the `ColumnarBatch`. Comet's `parquet_convert_struct_to_struct` (in `native/core/src/parquet/parquet_support.rs`) unconditionally collapsed a non-overlapping struct to a fully-null array via `NullBuffer::new_null(array.len())`, conflating the two cases. With the new Spark 4.1 default this caused 5 `ParquetIOSuite` tests to fail because non-null parent rows came back as `Row(null)` instead of `Row(Row(null, null))`. ## What changes are included in this PR? Adds `return_null_struct_if_all_fields_missing` to `SparkParquetOptions` (default `true` for backwards compat) and plumbs it through both scan paths: - **native_datafusion**: read in `CometNativeScan.convert` from the Spark conf, sent via a new `NativeScanCommon` proto field, threaded through `init_datasource_exec`. - **native_iceberg_compat**: read in `CometParquetFileFormat`, passed through the `NativeBatchReader` constructor and a new `initRecordBatchReader` JNI parameter. For both paths the JVM defaults to `"true"` on Spark < 4.1 (legacy hardcoded behavior) and `"false"` on Spark 4.1+ (matches Spark's registered conf default), with explicit user settings honored. When the flag is `false` and no requested fields overlap with the file's struct, `parquet_convert_struct_to_struct` now preserves `array.nulls()` so the parent struct's nullness from the file is propagated. The 5 `IgnoreComet` annotations for the SPARK-53535 / SPARK-54220 tests are removed from `dev/diffs/4.1.1.diff`. ## How are these changes tested? - New test `issue #4136: struct with all requested fields missing in file` in `CometNativeReaderSuite`, parameterized over both `native_datafusion` and `native_iceberg_compat`. The test writes a Parquet file with `struct<_1: struct<_1: int, _2: string>>` containing a mix of null and non-null parent rows, reads with a schema requesting `struct<_1: struct<_3: int, _4: long>>` (all inner fields missing in the file), and verifies the result matches Spark's vectorized reader for both `legacy=true` and `legacy=false` modes. Both impls fail before this change, both pass after. Full `CometNativeReaderSuite` (100 tests) passes. - The 5 `ParquetIOSuite` tests un-ignored in `dev/diffs/4.1.1.diff` will run in the Spark SQL CI workflow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
