sharkdtu commented on PR #1720:
URL: https://github.com/apache/iceberg-python/pull/1720#issuecomment-2700025292

   > @sharkdtu Thanks for the added context. Still, I don't think this is the 
right place to add this.
   > 
   > Would each of the Ray workers call `_dataframe_to_data_files`? In the 
worst case, this might lead to `partitions * workers` number of data files. 
Instead, the idea behind the notion of Tasks is that they can be fed into a 
distributed system. The current `_dataframe_to_data_files` does both the 
generation of Tasks and writes the Parquet files. How about splitting this into 
`_dataframe_to_write_tasks` and `_write_tasks_to_parquet`, where Ray would 
implement a distributed variant of the latter. Thoughts?
   
   @Fokko Thanks for the comments. I think `WriteTask` is not the task of 
distributed system, It's just a writer for writting batches of records. The 
number of data files can be controlled by repartitioning before writting, like 
spark.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to