aihuaxu commented on PR #14297:
URL: https://github.com/apache/iceberg/pull/14297#issuecomment-3443998678

   Thanks @huaxingao and @pvary for reviewing, and thanks to Huaxin for 
explaining how the writer works in Spark.
   
   Regarding the concern about unstable schemas, Spark's approach makes sense: 
   - If a field appears consistently with a consistent type, create both 
`value` and `typed_value`
   - If a field appears with inconsistent types, create only `value`
   - Drop fields that occur in less than 10% of sampled rows
   - Cap the total at 300 fields (counting `value` and `typed_value` separately)
   
   We could implement similar heuristics. Additionally, making the shredded 
schema configurable would allow users to choose which fields to shred at write 
time based on their read patterns.
   
   For this POC, I'd like any feedback on whether there are any significant 
high-level design options to consider first and if this approach is acceptable. 
This seems hacky. I may have missed big picture on how the writers work across 
Spark + Iceberg + Parquet and we may have better way.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to