Fokko opened a new pull request, #6645: URL: https://github.com/apache/iceberg/pull/6645
Alternative for the multithreading part of: https://github.com/apache/iceberg/pull/6590 This uses the ThreadPool approach instead of ThreadPoolExecutor. The ThreadPoolExecutor is more flexible and works well with heterogeneous tasks. This allows the user to handle exceptions per task and able to cancel individual tasks. But the ThreadPoolExecutor also has some limitations such as not being able to forcefully terminate all the tasks. For reading tasks I think the ThreadPool might be more appropriate, but for writing the ThreadPoolExecutor might be more applicable. A very nice writeup of the differences is available in this blog: https://superfastpython.com/threadpool-vs-threadpoolexecutor/ Before: ``` ➜ python git:(fd-threadpool) time python3 /tmp/test.py python3 /tmp/test.py 3.45s user 2.84s system 2% cpu 3:34.19 total ``` After: ``` ➜ python git:(fd-threadpool) ✗ time python3 /tmp/test.py python3 /tmp/test.py 3.13s user 2.83s system 19% cpu 31.369 total ➜ python git:(fd-threadpool) ✗ time python3 /tmp/test.py python3 /tmp/test.py 2.94s user 3.08s system 18% cpu 32.538 total ➜ python git:(fd-threadpool) ✗ time python3 /tmp/test.py python3 /tmp/test.py 2.84s user 3.14s system 20% cpu 29.033 total ``` Longlining the requests from EU to the USA might impact the results a bit but makes IO more dominant. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
