kevinjqliu commented on issue #1479: URL: https://github.com/apache/iceberg-python/issues/1479#issuecomment-2568436698
> there is no noticeable time difference between single-threaded and multi-threaded execution. The total time is directly proportional to the number of manifest entries. Could you print out `ExecutorFactory.max_workers()` to double check the value? > For instance, consider a scenario with 6 manifest files, each containing 7,000 entries. With max-workers=32, the code spawns 6 threads, each completing after approximately 30 seconds concurrently. In contrast, with max-workers=1, the code processes the manifest files sequentially, yet still finishes in roughly 30 seconds. Theres already some discussions around this in #1229. The issue might be with I/O bound tasks and the python GIL. Can you give `ProcessPoolExecutor` a try? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org