jpugliesi commented on issue #1229:
URL: 
https://github.com/apache/iceberg-python/issues/1229#issuecomment-2557778425

   Just to contribute some findings: We also encountered this case where 
`pyiceberg`'s  scanning`plan_files` was surprisingly slow reading manifest 
files from GCS. Switching the `py-io-impl` from 
`pyiceberg.io.pyarrow.PyArrowFileIO` to `pyiceberg.io.fsspec.FsspecFileIO` 
improved the performance significantly. Attached are some screenshots of Traces 
(run on my laptop), showing the performance difference we've consistently 
observed using the different `py-io-impl`s:
   
   `pyiceberg.io.pyarrow.PyArrowFileIO`:
   <img width="1709" alt="image" 
src="https://github.com/user-attachments/assets/a7b094fa-4dbd-4b6c-95c8-b6644e099c1d";
 />
   
   
   `pyiceberg.io.fsspec.FsspecFileIO`
   <img width="1705" alt="image" 
src="https://github.com/user-attachments/assets/a4df475d-f4dc-4c84-b55e-2b7932ead6c2";
 />
   
   With `PyArrowFileIO`, it looks like there is some resource contention. We 
tried tuning various things, such as 
[`ARROW_IO_THREADS`](https://arrow.apache.org/docs/cpp/threading.html#cpu-vs-i-o),
 but ultimately never identified the root issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to