thijsheijden opened a new issue, #2132:
URL: https://github.com/apache/iceberg-python/issues/2132

   ### Question
   
   Hi! I am trying to add 1 million existing Parquet files to an Iceberg table 
using the `add_files` procedure. I am inserting in 1000 batches of 1000 files. 
Every batch takes longer than the previous batch, and at this point each batch 
is taking around 1-2 minutes. This is much much slower than Spark, which 
remains consistent throughout the entire insertion. How can I speed this up? 
The Parquet files already have metadata so perhaps that could be exploited 
somehow? Below is the code I am using:
   
   
   ```
   warehouse_path = "/warehouse"
   catalog = load_catalog(
       "pyiceberg",
       **{
           'type': 'sql',
           "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
           "warehouse": f"file://{warehouse_path}",
       },
   )
   catalog.create_namespace_if_not_exists("default")
   
   # Load the batches of files to import
   batches = os.listdir(args.file_dir)
   first_file = os.path.join(args.file_dir, "batch_0", 
os.listdir(os.path.join(args.file_dir, "batch_0"))[0])
   
   # Create table using schema of the first file
   df = pq.read_table(first_file)
   table = catalog.create_table_if_not_exists(
       f"default.{args.table}",
       schema=df.schema,
   )
   
   batch_idx = 1
   for batch_dir in batches:
       print(f"Adding batch {batch_idx}")
       batch_dir =  os.path.join(args.file_dir, batch_dir)
       file_paths = os.listdir(batch_dir)
       file_paths = [batch_dir + '/' + s for s in file_paths]
       table.add_files(file_paths=file_paths)
       batch_idx += 1
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to