Re: [PR] Add plan tasks for TableScan [iceberg-python]

via GitHub Mon, 23 Dec 2024 07:08:29 -0800


Fokko commented on PR #1427:
URL: https://github.com/apache/iceberg-python/pull/1427#issuecomment-2559871997


   Just to add some context:
   
   > Currently, PyIceberg's read path assumes to be run on a single node 
machine. This assumption is embedded in the way we plan and execute the read 
path. For example, we use multi-threading and not (yet) multi-processing.
   
   We explored multi-processing, but it didn't give any advantages in terms of 
performance and introduced a lot of issues around pickling between the 
processes. Therefore we rely on multi-threading and most of the Arrow stuff 
happens with the GIL released anyway (diminishing the upside of 
multi-processing).
   
   ```python
   tasks = DistributedTableScan(table).plan_files()
   ```
   
   I don't think this is the right path forward as it creates more confusion. 
It is still part of the public API, but it introduces another way of doing the 
same thing. Also, would this return the same set of files? I agree with what 
@corleyma said, and I think we can add this API, just not make it public. I 
think this might create a lot of confusion for the user, for example, PyIceberg 
itself uses `plan_files`, Daft uses `plan_files` with their logic to combine 
certain tasks/files, and Ray would use the `plan_tasks` where it respects the 
newly added configuration. Why not add this logic to combine the tasks directly 
with Ray?
   
   > Another option would be to provide a plan_util to support plan tasks like 
the Java-side implementation.
   
   I like that idea. Could you elaborate on that? Are you suggesting something 
like:
   
   ```python
   def convert_files_to_tasks(
       files: Iterable[FileScanTask],
       target_split_size: int,
       split_file_open_cost: int,
       loop_back: int
   ) -> List[CombinedFileScanTask]:
       ...
   ```
   
   I think I like that idea


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add plan tasks for TableScan [iceberg-python]

Reply via email to