DieHertz commented on PR #614: URL: https://github.com/apache/iceberg-python/pull/614#issuecomment-2375186118
Hi guys, sorry if it's not the right place to ask this question. Do you know of a viable way to speed up `table.inspect.files()` for large tables? Maybe something in mind that I could implement and contribute to upstream. I haven't profiled yet but I guess the gist of the issue is `manifest.fetch_manifest_entry` being called synchronously and sequentially in a loop. But offloading this to a a thread-based executor doesn't help much, probably because of GIL, and a process-based executor is harder to implement because of unpicklable types involved. As of now pyspark's `.files` metatable collection can be done considerably quicker than pyiceberg's -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org