Fokko commented on PR #1427:
URL: https://github.com/apache/iceberg-python/pull/1427#issuecomment-2559871997
Just to add some context:
> Currently, PyIceberg's read path assumes to be run on a single node
machine. This assumption is embedded in the way we plan and execute the read
path. For example, we use multi-threading and not (yet) multi-processing.
We explored multi-processing, but it didn't give any advantages in terms of
performance and introduced a lot of issues around pickling between the
processes. Therefore we rely on multi-threading and most of the Arrow stuff
happens with the GIL released anyway (diminishing the upside of
multi-processing).
```python
tasks = DistributedTableScan(table).plan_files()
```
I don't think this is the right path forward as it creates more confusion.
It is still part of the public API, but it introduces another way of doing the
same thing. Also, would this return the same set of files? I agree with what
@corleyma said, and I think we can add this API, just not make it public. I
think this might create a lot of confusion for the user, for example, PyIceberg
itself uses `plan_files`, Daft uses `plan_files` with their logic to combine
certain tasks/files, and Ray would use the `plan_tasks` where it respects the
newly added configuration. Why not add this logic to combine the tasks directly
with Ray?
> Another option would be to provide a plan_util to support plan tasks like
the Java-side implementation.
I like that idea. Could you elaborate on that? Are you suggesting something
like:
```python
def convert_files_to_tasks(
files: Iterable[FileScanTask],
target_split_size: int,
split_file_open_cost: int,
loop_back: int
) -> List[CombinedFileScanTask]:
...
```
I think I like that idea
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]