[I] [C++][Dataset] Parquet schema lost on dataset write [arrow]

via GitHub Fri, 28 Mar 2025 12:51:31 -0700


beryan opened a new issue, #45969:
URL: https://github.com/apache/arrow/issues/45969


   ### Describe the enhancement requested
   
   As a user of the Arrow Dataset API I would like to write partitioned data 
and preserve Parquet schema information. 
   
   For example, I may have an arrow::Table containing Parquet `INTERVAL` data 
stored in it's physical type representation, a `fixed_len_byte_array` of length 
12. Because no arrow::Schema types are a direct match I use a 
`arrow::FixedSizeBinaryBuilder` to create the table. Existing properties and 
`arrow::dataset::FileSystemDataset::Write()` don't support providing a native 
schema for the output file format. As a result, the Parquet logical types of 
written data that do not have an arrow::schema equivalent are lost.
   
   Some Parquet logical types affected:
   - interval
   - uuid
   - enum
   - json
   - bson
   
   **Current behavior:** When using the Arrow Dataset API, data type 
roundtripping is limited by the types arrow::schema can represent
   
   **Desired behavior:** Provide a parquet schema that allows the user to 
specify a target schema.
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [C++][Dataset] Parquet schema lost on dataset write [arrow]

Reply via email to