sharkdtu commented on PR #1720: URL: https://github.com/apache/iceberg-python/pull/1720#issuecomment-2700025292
> @sharkdtu Thanks for the added context. Still, I don't think this is the right place to add this. > > Would each of the Ray workers call `_dataframe_to_data_files`? In the worst case, this might lead to `partitions * workers` number of data files. Instead, the idea behind the notion of Tasks is that they can be fed into a distributed system. The current `_dataframe_to_data_files` does both the generation of Tasks and writes the Parquet files. How about splitting this into `_dataframe_to_write_tasks` and `_write_tasks_to_parquet`, where Ray would implement a distributed variant of the latter. Thoughts? @Fokko Thanks for the comments. I think `WriteTask` is not the task of distributed system, It's just a writer for writting batches of records. The number of data files can be controlled by repartitioning before writting, like spark. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org