bigluck commented on issue #1335: URL: https://github.com/apache/iceberg-python/issues/1335#issuecomment-2487492436
I believe @vtk9 is suggesting the files to be read in parallel rather than sequentially. I could be mistaken, but it seems that if you have 10,000 files, each one is being read one after the other. This approach can be quite time-consuming, even though I understand that we are only reading the metadata of each parquet file. One option could be to have something like (pseudo-code alert): ```python def parquet_files_to_data_files(io: FileIO, table_metadata: TableMetadata, file_paths: Iterator[str]) -> Iterator[DataFile]: futures = [] with concurrent.futures.ThreadPoolExecutor() as executor: for file_path in file_paths: futures.append(executor.submit(scan_file, file_path)) for future in concurrent.futures.as_completed(futures): yield future.result() ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org