Re: [PR] Add plan tasks for TableScan [iceberg-python]

via GitHub Fri, 20 Dec 2024 08:38:21 -0800


kevinjqliu commented on PR #1427:
URL: https://github.com/apache/iceberg-python/pull/1427#issuecomment-2557356375


   Thanks everyone for the great discussion here! To summarize the thread 
above, I think the main concern here is around exposing this functionality as 
part of PyIceberg's `DataScan` public API. 
   
   Currently, PyIceberg's read path assumes to be run on a single node machine. 
This assumption is embedded in the way we plan and execute the read path. For 
example, we use multi-threading and not (yet) multi-processing.
   
   As an Iceberg library for the python ecosystem, I do believe there's value 
for PyIceberg to provide the helper methods for distributed processing. 
   I'd like to propose a path forward for this PR. Instead of integrating the 
feature directly into `DataScan`, what if we create a new class (or subclass 
`DataScan`) specifically for distributed processing? We can encapsulate the 
planning and execution logic inside this new class.  The goal is to provide 
primitives to allow work to be distributed. 
   
   I imagine something like this if I want to integrate with Ray. 
   ```
   table = catalog.load_table("blah")
   tasks = table.distributed_scan()
   futures = [process_task_remote.remote(task) for task in tasks] # Submit 
tasks to Ray for parallel processing
   results = ray.get(futures)
   ```
   
   The `distributed_scan` is a helper function in the `Table` class. 
Alternatively, we can not expose this at all and have users call the new class 
directly. 
   ```
   tasks = DistributedTableScan(table).plan_files()
   ```
   
   Looking forward to hear what people think! 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add plan tasks for TableScan [iceberg-python]

Reply via email to