nssalian commented on PR #14297:
URL: https://github.com/apache/iceberg/pull/14297#issuecomment-4112833959
Thanks for the feedback @pvary. @aihuaxu , @pvary and I synced offline to
discuss how to move this forward. Adding a note here so that it's easy to
review. I've made the following changes:
1. Refactored per @pvary's suggestion to buffer above the writer. Added
`BufferedFileAppender` in iceberg-core that buffers the first N rows, infers
the shredded schema, then creates the real writer.
2. Moved `VariantShreddingAnalyzer` from Spark to the parquet module as an
abstract class for Spark/Flink reuse.
@Guosmilesmile you can eventually reuse this in your PR.
3. Added `Parquet.WriteBuilder.withFileSchema(MessageType)` to supply a
pre-computed Parquet schema at write time.
4. Removed `WriterLazyInitializable,
SparkParquetWriterWithVariantShredding`, and the `4-arg WriterFunction` since
that wasn't the pattern preferred.
5. Additional tests and added an extra check for precision in decimals.
@huaxingao, @pvary, @aihuaxu, please review when you have a chance.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]