andygrove opened a new issue, #4199:
URL: https://github.com/apache/datafusion-comet/issues/4199

   ## Description
   
   Spark writes \`NullType\` columns to parquet as \`BOOLEAN\` physical type 
with an \`Unknown\` logical type annotation (comment in 
[\`ParquetSchemaConverter.scala\`](https://github.com/apache/spark/blob/v4.1.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L877):
 _"Selected primitive type here doesn't have significance"_). parquet-rs only 
accepts \`LogicalType::Unknown\` paired with \`PhysicalType::INT32\` and 
rejects any other physical type with \`Cannot annotate Unknown from BOOLEAN for 
field '…'\` ([parquet-57.2.0/src/schema/types.rs:401, 
:423](https://docs.rs/parquet/57.2.0/parquet/schema/types/index.html)).
   
   Result: any attempt to read a Spark-written parquet file that contains a 
\`NullType\` field fails in Comet with:
   
   \`\`\`
   org.apache.comet.CometNativeException: Parquet error: Cannot annotate 
Unknown from BOOLEAN for field '_1'
   \`\`\`
   
   The SPARK-54220 test in \`ParquetIOSuite\` (\`SPARK-54220: vectorized 
reader: missing all struct fields, struct with NullType only\`) is the concrete 
reproducer. It was unignored as part of PR #4190 / issue #4136 but crashes on 
the parquet read path before the new fix in 
\`parquet_convert_struct_to_struct\` is reached.
   
   ## Reproducer
   
   See \`issue #4136: struct with only NullType fields in file (SPARK-54220)\` 
in \`CometNativeReaderSuite\`. The failure manifests for both 
\`native_datafusion\` and \`native_iceberg_compat\`.
   
   ## Suspected fix
   
   Either:
   
   1. Upstream parquet-rs to accept \`(Unknown, BOOLEAN)\` (and arguably any 
physical type, since Spark's comment makes clear the physical type is a 
don't-care), or
   2. Work around in Comet: in the schema adapter or parquet reader factory, 
rewrite the physical type to INT32 before passing it to parquet-rs' validator — 
or special-case the Unknown-annotated field at read time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to