corleyma commented on issue #1229: URL: https://github.com/apache/iceberg-python/issues/1229#issuecomment-2430215685
>IMO it makes sense to wait until it gets implemented or contribute there, rather than writing our own general-purpose Avro-To-Arrow code. We don't need general-purpose avro to arrow code, though, we only need a fast path iceberg manifests to arrow. Since we already have a layer that's decoding avro quickly and then building python dictionaries, we could bypass the dictionaries are build a pyarrow table. The request for avro read support in Arrow is 7 years old at this point, so I don't think I'd wait on that. > In the meantime the approach with the ProcessPoolExecutor should give a significant improvement Using multiprocessing comes with a lot of potential headaches around cross-platform support, and especially when it happens transparently to callers, can create difficulties for other projects that want to use pyiceberg as an sdk. If there's a way to improve performance without spawning subprocesses it's worth exploring. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org