Re: [I] [Feature Request] Speed up InspectTable.files() [iceberg-python]

2025-01-20 Thread via GitHub
gitzwz commented on issue #1229: URL: https://github.com/apache/iceberg-python/issues/1229#issuecomment-2601733467 > Here's what I did in my particular case to give the general idea: Thanks to [DieHertz](https://github.com/DieHertz), I tried the demo code with CPython 3.12 & Debian

Re: [I] [Feature Request] Speed up InspectTable.files() [iceberg-python]

2024-12-20 Thread via GitHub
jpugliesi commented on issue #1229: URL: https://github.com/apache/iceberg-python/issues/1229#issuecomment-2557778425 Just to contribute some findings: We also encountered this case where `pyiceberg`'s scanning`plan_files` was surprisingly slow reading manifest files from GCS. Switching t

Re: [I] [Feature Request] Speed up InspectTable.files() [iceberg-python]

2024-11-20 Thread via GitHub
11xor6 commented on issue #1229: URL: https://github.com/apache/iceberg-python/issues/1229#issuecomment-2489912105 I'm encountering this as well, specifically with methods that rely on `plan_files`. If there's anything I can do to help or move this forward please let me know. -- This is

Re: [I] [Feature Request] Speed up InspectTable.files() [iceberg-python]

2024-10-22 Thread via GitHub
DieHertz commented on issue #1229: URL: https://github.com/apache/iceberg-python/issues/1229#issuecomment-2430363061 > If the work is done in Cython avro decoder -> pyarrow recordbatches using PyArrow Cython API, then that also leaves room to release the GIL for meaningful threaded concurr

Re: [I] [Feature Request] Speed up InspectTable.files() [iceberg-python]

2024-10-22 Thread via GitHub
corleyma commented on issue #1229: URL: https://github.com/apache/iceberg-python/issues/1229#issuecomment-2430215685 >IMO it makes sense to wait until it gets implemented or contribute there, rather than writing our own general-purpose Avro-To-Arrow code. We don't need general-purpos

Re: [I] [Feature Request] Speed up InspectTable.files() [iceberg-python]

2024-10-22 Thread via GitHub
DieHertz commented on issue #1229: URL: https://github.com/apache/iceberg-python/issues/1229#issuecomment-2429473805 Here I have extracted the code returning `list[dict]` of entries for each `Manifest` and run it inside the `ThreadPoolExecutor` provided by the `pyiceberg.utils.concurrent.E

Re: [I] [Feature Request] Speed up InspectTable.files() [iceberg-python]

2024-10-22 Thread via GitHub
DieHertz commented on issue #1229: URL: https://github.com/apache/iceberg-python/issues/1229#issuecomment-2428572896 By the way, there is a related goal in Arrow: https://github.com/apache/arrow/issues/16991 IMO it makes sense to wait until it gets implemented or contribute there, r

Re: [I] [Feature Request] Speed up InspectTable.files() [iceberg-python]

2024-10-22 Thread via GitHub
DieHertz commented on issue #1229: URL: https://github.com/apache/iceberg-python/issues/1229#issuecomment-2428470415 > Had to solve it with cloudpickle. With `functors.partial` it doesn't seem to be necessary, as I've shown in my earlier messages. > yep, optimistically build

Re: [I] [Feature Request] Speed up InspectTable.files() [iceberg-python]

2024-10-22 Thread via GitHub
DieHertz commented on issue #1229: URL: https://github.com/apache/iceberg-python/issues/1229#issuecomment-2428451067 So I haven't tried any actual changes yet, but decided to collect some baseline measurements with py-spy. First there's pyiceberg 7.0.1 `.inspect.files()` on my big ph

Re: [I] [Feature Request] Speed up InspectTable.files() [iceberg-python]

2024-10-18 Thread via GitHub
kevinjqliu commented on issue #1229: URL: https://github.com/apache/iceberg-python/issues/1229#issuecomment-2423385861 > pyiceberg implements its own avro reader/writer using Cython yep, optimistically build avro decoder, fall back to pure python. See https://github.com/apache/ice

Re: [I] [Feature Request] Speed up InspectTable.files() [iceberg-python]

2024-10-18 Thread via GitHub
corleyma commented on issue #1229: URL: https://github.com/apache/iceberg-python/issues/1229#issuecomment-2422931184 > Most of the time is spent processing the manifests record-by-record and converting each record to a dict I haven't looked at this closely, but if memory serves, pyic

Re: [I] [Feature Request] Speed up InspectTable.files() [iceberg-python]

2024-10-15 Thread via GitHub
kevinjqliu commented on issue #1229: URL: https://github.com/apache/iceberg-python/issues/1229#issuecomment-2414941022 BTW I'm not opposed to using `ProcessPoolExecutor`. I'm using curious why `ThreadPoolExecutor` cant hit the same performance profile -- This is an automated message fro

Re: [I] [Feature Request] Speed up InspectTable.files() [iceberg-python]

2024-10-15 Thread via GitHub
kevinjqliu commented on issue #1229: URL: https://github.com/apache/iceberg-python/issues/1229#issuecomment-2414939042 > Most of the time is spent processing the manifests record-by-record and converting each record to a dict Heres a snippet using threads to parallelize both reading

Re: [I] [Feature Request] Speed up InspectTable.files() [iceberg-python]

2024-10-15 Thread via GitHub
DieHertz commented on issue #1229: URL: https://github.com/apache/iceberg-python/issues/1229#issuecomment-2414915918 Indeed it is good enough for I/O-bound tasks, but in my understanding this part is CPU-bound. I think so because I'm observing close to 100% CPU usage when inside `pl

Re: [I] [Feature Request] Speed up InspectTable.files() [iceberg-python]

2024-10-15 Thread via GitHub
kevinjqliu commented on issue #1229: URL: https://github.com/apache/iceberg-python/issues/1229#issuecomment-2414816052 That's interesting. I thought the `ThreadPoolExexutor` is good for I/O bound tasks such as reading from the avro manifest files. If you have a PoC, its something I'd wa

Re: [I] [Feature Request] Speed up InspectTable.files() [iceberg-python]

2024-10-15 Thread via GitHub
DieHertz commented on issue #1229: URL: https://github.com/apache/iceberg-python/issues/1229#issuecomment-2414395991 > There's already an ExecutorFactory, do you think we can use that instead of ProcessPoolExecutor? The issue with the `ExecutorFactory` is it's using a `ThreadPool

Re: [I] [Feature Request] Speed up InspectTable.files() [iceberg-python]

2024-10-15 Thread via GitHub
kevinjqliu commented on issue #1229: URL: https://github.com/apache/iceberg-python/issues/1229#issuecomment-2414173534 As an aside, I think parallelly reading multiple manifests is something we'd want to reuse at other parts of the program -- This is an automated message from the Apache

Re: [I] [Feature Request] Speed up InspectTable.files() [iceberg-python]

2024-10-15 Thread via GitHub
DieHertz commented on issue #1229: URL: https://github.com/apache/iceberg-python/issues/1229#issuecomment-2414054569 > Would you be interested in working on this issue? Yes, I'd be happy to contribute back -- This is an automated message from the Apache Git Service. To respond to t

Re: [I] [Feature Request] Speed up InspectTable.files() [iceberg-python]

2024-10-15 Thread via GitHub
sungwy commented on issue #1229: URL: https://github.com/apache/iceberg-python/issues/1229#issuecomment-2413901958 Hi @DieHertz - thank you for raising this issue, and for sharing your benchmarks. I think this is a great idea, that I think we should also consider applying to other `Inspect