Fokko commented on PR #1720: URL: https://github.com/apache/iceberg-python/pull/1720#issuecomment-2696956695
@sharkdtu Thanks for the added context. Still, I don't think this is the right place to add this. Would each of the Ray workers call `_dataframe_to_data_files`? In the worst case, this might lead to `partitions * workers` number of data files. Instead, the idea behind the notion of Tasks is that they can be fed into a distributed system. The current `_dataframe_to_data_files` does both the generation of Tasks and writes the Parquet files. How about splitting this into `_dataframe_to_write_tasks` and `_write_tasks_to_parquet`, where Ray would implement a distributed variant of the latter. Thoughts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org