Re: [PR] Expose variantShreddingFunc() in Parquet.DataWriteBuilder [iceberg]

via GitHub Fri, 26 Sep 2025 03:31:16 -0700


pvary commented on PR #14153:
URL: https://github.com/apache/iceberg/pull/14153#issuecomment-3338003415


   > > How will this method be used?
   > 
   > This method is intended for use in 
`HiveFileWriterFactory#configureDataWrite(Parquet.DataWriteBuilder builder)`. 
The `HiveFileWriterFactory` itself extends 
`org.apache.iceberg.data.BaseFileWriterFactory<Record>`.
   
   Nice — I had honestly forgotten how we implemented writes in Hive.
   You might want to check out the File Format API proposal, since it will 
impact Hive integration as well.
   
   > > What is the plan to provide the same functionality for Avro and ORC?
   > 
   > I’m not aware of any such plans. At the moment, variant shredding is only 
supported in Parquet.
   
   If we want this feature to be available across all supported file formats, 
we’ll eventually need a more generic interface.
   
   > > I have checked the PR, and I'm afraid, that we need to revisit exposing 
this change. This is a File Format specific method
   > 
   > I don’t see how this change could interfere with anything. It’s just a 
simple setter for the variant shredding function in the Parquet FileAppender, 
and indeed, it’s a file format specific method.
   > 
   > How else can we take advantage of the variant shredding feature?
   
   If our goal is to support all V3 features across all file formats, we’ll 
need consistent functionality in every WriteBuilder / DataWriteBuilder. That’s 
exactly what the File Format API proposal aims to address.
   
   So far, the API has planned to expose an inputSchema(S type) method, 
allowing engines to pass in the input schema — which implicitly defines the 
shredding configuration. The VariantShreddingFunction offers an alternative, 
and we’ll be discussing this in Monday’s File Format API Sync. But even if we 
go with the function-based approach, we may need to revise the signature to 
make it more generic. It could return an engine type or an Iceberg type, but 
tying it to a format-specific type would defeat the purpose of the API — 
namely, providing a file format–independent interface for users.
   
   Whichever direction we choose, I expect changes in how shredding is handled 
within writers. That’s why I mentioned that while we do need to expose this 
functionality, we might want to rethink how we expose it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Expose variantShreddingFunc() in Parquet.DataWriteBuilder [iceberg]

Reply via email to