kevinjqliu commented on PR #1427: URL: https://github.com/apache/iceberg-python/pull/1427#issuecomment-2557356375
Thanks everyone for the great discussion here! To summarize the thread above, I think the main concern here is around exposing this functionality as part of PyIceberg's `DataScan` public API. Currently, PyIceberg's read path assumes to be run on a single node machine. This assumption is embedded in the way we plan and execute the read path. For example, we use multi-threading and not (yet) multi-processing. As an Iceberg library for the python ecosystem, I do believe there's value for PyIceberg to provide the helper methods for distributed processing. I'd like to propose a path forward for this PR. Instead of integrating the feature directly into `DataScan`, what if we create a new class (or subclass `DataScan`) specifically for distributed processing? We can encapsulate the planning and execution logic inside this new class. The goal is to provide primitives to allow work to be distributed. I imagine something like this if I want to integrate with Ray. ``` table = catalog.load_table("blah") tasks = table.distributed_scan() futures = [process_task_remote.remote(task) for task in tasks] # Submit tasks to Ray for parallel processing results = ray.get(futures) ``` The `distributed_scan` is a helper function in the `Table` class. Alternatively, we can not expose this at all and have users call the new class directly. ``` tasks = DistributedTableScan(table).plan_files() ``` Looking forward to hear what people think! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org