kevinjqliu commented on PR #1427:
URL: https://github.com/apache/iceberg-python/pull/1427#issuecomment-2557356375
Thanks everyone for the great discussion here! To summarize the thread
above, I think the main concern here is around exposing this functionality as
part of PyIceberg's `DataScan` public API.
Currently, PyIceberg's read path assumes to be run on a single node machine.
This assumption is embedded in the way we plan and execute the read path. For
example, we use multi-threading and not (yet) multi-processing.
As an Iceberg library for the python ecosystem, I do believe there's value
for PyIceberg to provide the helper methods for distributed processing.
I'd like to propose a path forward for this PR. Instead of integrating the
feature directly into `DataScan`, what if we create a new class (or subclass
`DataScan`) specifically for distributed processing? We can encapsulate the
planning and execution logic inside this new class. The goal is to provide
primitives to allow work to be distributed.
I imagine something like this if I want to integrate with Ray.
```
table = catalog.load_table("blah")
tasks = table.distributed_scan()
futures = [process_task_remote.remote(task) for task in tasks] # Submit
tasks to Ray for parallel processing
results = ray.get(futures)
```
The `distributed_scan` is a helper function in the `Table` class.
Alternatively, we can not expose this at all and have users call the new class
directly.
```
tasks = DistributedTableScan(table).plan_files()
```
Looking forward to hear what people think!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]