andygrove opened a new issue, #4199: URL: https://github.com/apache/datafusion-comet/issues/4199
## Description Spark writes \`NullType\` columns to parquet as \`BOOLEAN\` physical type with an \`Unknown\` logical type annotation (comment in [\`ParquetSchemaConverter.scala\`](https://github.com/apache/spark/blob/v4.1.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L877): _"Selected primitive type here doesn't have significance"_). parquet-rs only accepts \`LogicalType::Unknown\` paired with \`PhysicalType::INT32\` and rejects any other physical type with \`Cannot annotate Unknown from BOOLEAN for field '…'\` ([parquet-57.2.0/src/schema/types.rs:401, :423](https://docs.rs/parquet/57.2.0/parquet/schema/types/index.html)). Result: any attempt to read a Spark-written parquet file that contains a \`NullType\` field fails in Comet with: \`\`\` org.apache.comet.CometNativeException: Parquet error: Cannot annotate Unknown from BOOLEAN for field '_1' \`\`\` The SPARK-54220 test in \`ParquetIOSuite\` (\`SPARK-54220: vectorized reader: missing all struct fields, struct with NullType only\`) is the concrete reproducer. It was unignored as part of PR #4190 / issue #4136 but crashes on the parquet read path before the new fix in \`parquet_convert_struct_to_struct\` is reached. ## Reproducer See \`issue #4136: struct with only NullType fields in file (SPARK-54220)\` in \`CometNativeReaderSuite\`. The failure manifests for both \`native_datafusion\` and \`native_iceberg_compat\`. ## Suspected fix Either: 1. Upstream parquet-rs to accept \`(Unknown, BOOLEAN)\` (and arguably any physical type, since Spark's comment makes clear the physical type is a don't-care), or 2. Work around in Comet: in the schema adapter or parquet reader factory, rewrite the physical type to INT32 before passing it to parquet-rs' validator — or special-case the Unknown-annotated field at read time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
