Re: [I] [Feature Request] Speed up InspectTable.files() [iceberg-python]

via GitHub Tue, 22 Oct 2024 13:37:31 -0700


corleyma commented on issue #1229:
URL: 
https://github.com/apache/iceberg-python/issues/1229#issuecomment-2430215685


   >IMO it makes sense to wait until it gets implemented or contribute there, 
rather than writing our own general-purpose Avro-To-Arrow code.
   
   We don't need general-purpose avro to arrow code, though, we only need a 
fast path iceberg manifests to arrow.  Since we already have a layer that's 
decoding avro quickly and then building python dictionaries, we could bypass 
the dictionaries are build a pyarrow table.
   
   The request for avro read support in Arrow is 7 years old at this point, so 
I don't think I'd wait on that.
   
   > In the meantime the approach with the ProcessPoolExecutor should give a 
significant improvement
   
   Using multiprocessing comes with a lot of potential headaches around 
cross-platform support, and especially when it happens transparently to 
callers, can create difficulties for other projects that want to use pyiceberg 
as an sdk.  If there's a way to improve performance without spawning 
subprocesses it's worth exploring.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] [Feature Request] Speed up InspectTable.files() [iceberg-python]

Reply via email to