Re: [I] improve performance of Table.add_files by parallelizing [iceberg-python]

2025-03-03 Thread via GitHub
Fokko closed issue #1335: improve performance of Table.add_files by parallelizing URL: https://github.com/apache/iceberg-python/issues/1335 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specif

Re: [I] improve performance of Table.add_files by parallelizing [iceberg-python]

2024-11-20 Thread via GitHub
kevinjqliu commented on issue #1335: URL: https://github.com/apache/iceberg-python/issues/1335#issuecomment-2489493449 make sense, this is a feature -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [I] improve performance of Table.add_files by parallelizing [iceberg-python]

2024-11-20 Thread via GitHub
Fokko commented on issue #1335: URL: https://github.com/apache/iceberg-python/issues/1335#issuecomment-2489383016 Yes, looks like this shouldn't be too hard. I think it would be good to [re-use the `ExecutorFactory`](https://github.com/apache/iceberg-python/blob/main/pyiceberg/utils/concur

Re: [I] improve performance of Table.add_files by parallelizing [iceberg-python]

2024-11-20 Thread via GitHub
vtk9 commented on issue #1335: URL: https://github.com/apache/iceberg-python/issues/1335#issuecomment-2487544116 Apologies @kevinjqliu , i forgot to link the relevant slack thread https://apache-iceberg.slack.com/archives/C029EE6HQ5D/p1731611943890879 Exactly, thank you @bigluck!

Re: [I] improve performance of Table.add_files by parallelizing [iceberg-python]

2024-11-19 Thread via GitHub
kevinjqliu commented on issue #1335: URL: https://github.com/apache/iceberg-python/issues/1335#issuecomment-2487569620 sounds good! Feel free to ping me for review. I'll add this issue to the 0.8.1 milestone for now -- This is an automated message from the Apache Git Service. To respond

Re: [I] improve performance of Table.add_files by parallelizing [iceberg-python]

2024-11-19 Thread via GitHub
kevinjqliu commented on issue #1335: URL: https://github.com/apache/iceberg-python/issues/1335#issuecomment-2487549044 @vtk9 thanks for the context from slack, I must have missed that thread -- This is an automated message from the Apache Git Service. To respond to the message, please lo

Re: [I] improve performance of Table.add_files by parallelizing [iceberg-python]

2024-11-19 Thread via GitHub
kevinjqliu commented on issue #1335: URL: https://github.com/apache/iceberg-python/issues/1335#issuecomment-2487547946 thanks @bigluck that makes sense! I think `_parquet_files_to_data_files` might be a good place to add the parallelism @vtk9 is this something you would like to cont

Re: [I] improve performance of Table.add_files by parallelizing [iceberg-python]

2024-11-19 Thread via GitHub
bigluck commented on issue #1335: URL: https://github.com/apache/iceberg-python/issues/1335#issuecomment-2487492436 I believe @vtk9 is suggesting the files to be read in parallel rather than sequentially. I could be mistaken, but it seems that if you have 10,000 files, each one is

Re: [I] improve performance of Table.add_files by parallelizing [iceberg-python]

2024-11-19 Thread via GitHub
kevinjqliu commented on issue #1335: URL: https://github.com/apache/iceberg-python/issues/1335#issuecomment-2487476101 [`_parquet_files_to_data_files` is a generator](https://github.com/apache/iceberg-python/blob/3ccdc44735d70bd3ef6ed18b60b3eba43c4b3b44/pyiceberg/table/__init__.py#L1529-L15