Re: [PR] Allow writing dataframes that are either a subset of table schema or in arbitrary order [iceberg-python]

via GitHub Mon, 08 Jul 2024 17:44:11 -0700


kevinjqliu commented on PR #829:
URL: https://github.com/apache/iceberg-python/pull/829#issuecomment-2215827324


   > First of all, sorry for the late reply. Feel free to ping me more 
aggressively.
   
   No worries at all, I forgot to ping about this PR 
   
   > How about re-aligning the table before we write, otherwise we have to do 
all of this when reading. Most tables have far fewer writes than reads, so it 
is good to optimize for reads. 
   
   Can you talk a bit more about "re-aligning"? Is it to match the parquet 
schema with that of Iceberg's? 
   I see that `to_requested_schema` is currently used to coerce the data before 
it is written to parquet. 
   
https://github.com/apache/iceberg-python/blame/7dff359e0515839fbe24fac2108dcb2d64694b7a/pyiceberg/io/pyarrow.py#L1915-L1918
   
   Is the idea to do so for the entire arrow table before writing? If so, maybe 
we can push the `to_requested_schema` up the stack and simplify 
`write_parquet`. I also mentioned this in 
https://github.com/apache/iceberg-python/pull/786#discussion_r1646417180
   
   @Fokko 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Allow writing dataframes that are either a subset of table schema or in arbitrary order [iceberg-python]

Reply via email to