Re: [I] improve performance of Table.add_files by parallelizing [iceberg-python]

via GitHub Tue, 19 Nov 2024 21:35:37 -0800


bigluck commented on issue #1335:
URL: 
https://github.com/apache/iceberg-python/issues/1335#issuecomment-2487492436


   I believe @vtk9 is suggesting the files to be read in parallel rather than 
sequentially. 
   
   I could be mistaken, but it seems that if you have 10,000 files, each one is 
being read one after the other. This approach can be quite time-consuming, even 
though I understand that we are only reading the metadata of each parquet file.
   
   One option could be to have something like (pseudo-code alert):
   
   ```python
   def parquet_files_to_data_files(io: FileIO, table_metadata: TableMetadata, 
file_paths: Iterator[str]) -> Iterator[DataFile]:
       futures = []
       with concurrent.futures.ThreadPoolExecutor() as executor:
           for file_path in file_paths:
                futures.append(executor.submit(scan_file, file_path))
           for future in concurrent.futures.as_completed(futures):
                yield future.result()
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] improve performance of Table.add_files by parallelizing [iceberg-python]

Reply via email to