DieHertz commented on PR #614:
URL: https://github.com/apache/iceberg-python/pull/614#issuecomment-2375186118

   Hi guys, sorry if it's not the right place to ask this question.
   Do you know of a viable way to speed up `table.inspect.files()` for large 
tables?
   Maybe something in mind that I could implement and contribute to upstream.
   
   I haven't profiled yet but I guess the gist of the issue is 
`manifest.fetch_manifest_entry` being called synchronously and sequentially in 
a loop. But offloading this to a a thread-based executor doesn't help much, 
probably because of GIL, and a process-based executor is harder to implement 
because of unpicklable types involved.
   
   As of now pyspark's `.files` metatable collection can be done considerably 
quicker than pyiceberg's


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to