Fokko commented on PR #1427: URL: https://github.com/apache/iceberg-python/pull/1427#issuecomment-2559871997
Just to add some context: > Currently, PyIceberg's read path assumes to be run on a single node machine. This assumption is embedded in the way we plan and execute the read path. For example, we use multi-threading and not (yet) multi-processing. We explored multi-processing, but it didn't give any advantages in terms of performance and introduced a lot of issues around pickling between the processes. Therefore we rely on multi-threading and most of the Arrow stuff happens with the GIL released anyway (diminishing the upside of multi-processing). ```python tasks = DistributedTableScan(table).plan_files() ``` I don't think this is the right path forward as it creates more confusion. It is still part of the public API, but it introduces another way of doing the same thing. Also, would this return the same set of files? I agree with what @corleyma said, and I think we can add this API, just not make it public. I think this might create a lot of confusion for the user, for example, PyIceberg itself uses `plan_files`, Daft uses `plan_files` with their logic to combine certain tasks/files, and Ray would use the `plan_tasks` where it respects the newly added configuration. Why not add this logic to combine the tasks directly with Ray? > Another option would be to provide a plan_util to support plan tasks like the Java-side implementation. I like that idea. Could you elaborate on that? Are you suggesting something like: ```python def convert_files_to_tasks( files: Iterable[FileScanTask], target_split_size: int, split_file_open_cost: int, loop_back: int ) -> List[CombinedFileScanTask]: ... ``` I think I like that idea -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org