Fokko closed issue #1335: improve performance of Table.add_files by
parallelizing
URL: https://github.com/apache/iceberg-python/issues/1335
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specif
kevinjqliu commented on issue #1335:
URL:
https://github.com/apache/iceberg-python/issues/1335#issuecomment-2489493449
make sense, this is a feature
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go
Fokko commented on issue #1335:
URL:
https://github.com/apache/iceberg-python/issues/1335#issuecomment-2489383016
Yes, looks like this shouldn't be too hard. I think it would be good to
[re-use the
`ExecutorFactory`](https://github.com/apache/iceberg-python/blob/main/pyiceberg/utils/concur
vtk9 commented on issue #1335:
URL:
https://github.com/apache/iceberg-python/issues/1335#issuecomment-2487544116
Apologies @kevinjqliu , i forgot to link the relevant slack thread
https://apache-iceberg.slack.com/archives/C029EE6HQ5D/p1731611943890879
Exactly, thank you @bigluck!
kevinjqliu commented on issue #1335:
URL:
https://github.com/apache/iceberg-python/issues/1335#issuecomment-2487569620
sounds good! Feel free to ping me for review. I'll add this issue to the
0.8.1 milestone for now
--
This is an automated message from the Apache Git Service.
To respond
kevinjqliu commented on issue #1335:
URL:
https://github.com/apache/iceberg-python/issues/1335#issuecomment-2487549044
@vtk9 thanks for the context from slack, I must have missed that thread
--
This is an automated message from the Apache Git Service.
To respond to the message, please lo
kevinjqliu commented on issue #1335:
URL:
https://github.com/apache/iceberg-python/issues/1335#issuecomment-2487547946
thanks @bigluck that makes sense! I think `_parquet_files_to_data_files`
might be a good place to add the parallelism
@vtk9 is this something you would like to cont
bigluck commented on issue #1335:
URL:
https://github.com/apache/iceberg-python/issues/1335#issuecomment-2487492436
I believe @vtk9 is suggesting the files to be read in parallel rather than
sequentially.
I could be mistaken, but it seems that if you have 10,000 files, each one is
kevinjqliu commented on issue #1335:
URL:
https://github.com/apache/iceberg-python/issues/1335#issuecomment-2487476101
[`_parquet_files_to_data_files` is a
generator](https://github.com/apache/iceberg-python/blob/3ccdc44735d70bd3ef6ed18b60b3eba43c4b3b44/pyiceberg/table/__init__.py#L1529-L15