Re: [I] improve performance of Table.add_files by parallelizing [iceberg-python]

via GitHub Wed, 20 Nov 2024 11:28:41 -0800


Fokko commented on issue #1335:
URL: 
https://github.com/apache/iceberg-python/issues/1335#issuecomment-2489383016


   Yes, looks like this shouldn't be too hard. I think it would be good to 
[re-use the 
`ExecutorFactory`](https://github.com/apache/iceberg-python/blob/main/pyiceberg/utils/concurrent.py):
   
   I would refactor `parquet_files_to_data_files` to let it take a single file 
instead of an `Iterator`, and then call it `parquet_file_to_data_file`.
   
   ```python
   def _parquet_files_to_data_files(table_metadata: TableMetadata, file_paths: 
List[str], io: FileIO) -> Iterable[DataFile]:
       """Convert a list files into DataFiles.
   
       Returns:
           An iterable that supplies DataFiles that describe the parquet files.
       """
       from pyiceberg.io.pyarrow import parquet_files_to_data_files
   
       executor = ExecutorFactory.get_or_create()
       futures = [
           executor.submit(
               parquet_file_to_data_file,
               io,
               table_metadata,
               file_path
           )
           for file_path in file_paths
       ]
   
       return [f.result() for f in futures if f.result()]
   ```
   
   @kevinjqliu I would not classify this as a bugfix, so I'm not sure if this 
is appropriate for 0.8.1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] improve performance of Table.add_files by parallelizing [iceberg-python]

Reply via email to