jpugliesi commented on issue #1229: URL: https://github.com/apache/iceberg-python/issues/1229#issuecomment-2557778425
Just to contribute some findings: We also encountered this case where `pyiceberg`'s scanning`plan_files` was surprisingly slow reading manifest files from GCS. Switching the `py-io-impl` from `pyiceberg.io.pyarrow.PyArrowFileIO` to `pyiceberg.io.fsspec.FsspecFileIO` improved the performance significantly. Attached are some screenshots of Traces (run on my laptop), showing the performance difference we've consistently observed using the different `py-io-impl`s: `pyiceberg.io.pyarrow.PyArrowFileIO`: <img width="1709" alt="image" src="https://github.com/user-attachments/assets/a7b094fa-4dbd-4b6c-95c8-b6644e099c1d" /> `pyiceberg.io.fsspec.FsspecFileIO` <img width="1705" alt="image" src="https://github.com/user-attachments/assets/a4df475d-f4dc-4c84-b55e-2b7932ead6c2" /> With `PyArrowFileIO`, it looks like there is some resource contention. We tried tuning various things, such as [`ARROW_IO_THREADS`](https://arrow.apache.org/docs/cpp/threading.html#cpu-vs-i-o), but ultimately never identified the root issue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org