Fokko commented on PR #1720:
URL: https://github.com/apache/iceberg-python/pull/1720#issuecomment-2696956695

   @sharkdtu Thanks for the added context. Still, I don't think this is the 
right place to add this.
   
   Would each of the Ray workers call `_dataframe_to_data_files`? In the worst 
case, this might lead to `partitions * workers` number of data files. Instead, 
the idea behind the notion of Tasks is that they can be fed into a distributed 
system. The current `_dataframe_to_data_files` does both the generation of 
Tasks and writes the Parquet files. How about splitting this into 
`_dataframe_to_write_tasks` and `_write_tasks_to_parquet`, where Ray would 
implement a distributed variant of the latter. Thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to