thijsheijden opened a new issue, #2132: URL: https://github.com/apache/iceberg-python/issues/2132
### Question Hi! I am trying to add 1 million existing Parquet files to an Iceberg table using the `add_files` procedure. I am inserting in 1000 batches of 1000 files. Every batch takes longer than the previous batch, and at this point each batch is taking around 1-2 minutes. This is much much slower than Spark, which remains consistent throughout the entire insertion. How can I speed this up? The Parquet files already have metadata so perhaps that could be exploited somehow? Below is the code I am using: ``` warehouse_path = "/warehouse" catalog = load_catalog( "pyiceberg", **{ 'type': 'sql', "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db", "warehouse": f"file://{warehouse_path}", }, ) catalog.create_namespace_if_not_exists("default") # Load the batches of files to import batches = os.listdir(args.file_dir) first_file = os.path.join(args.file_dir, "batch_0", os.listdir(os.path.join(args.file_dir, "batch_0"))[0]) # Create table using schema of the first file df = pq.read_table(first_file) table = catalog.create_table_if_not_exists( f"default.{args.table}", schema=df.schema, ) batch_idx = 1 for batch_dir in batches: print(f"Adding batch {batch_idx}") batch_dir = os.path.join(args.file_dir, batch_dir) file_paths = os.listdir(batch_dir) file_paths = [batch_dir + '/' + s for s in file_paths] table.add_files(file_paths=file_paths) batch_idx += 1 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org