Re: [I] Parallel Table.append [iceberg-python]

via GitHub Tue, 27 Feb 2024 14:08:07 -0800


kevinjqliu commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1967707048


   As a way to benchmark multithreaded writes to multiple parquet files, I've 
noticed that Duckdb's COPY command has the `per_thread_output` and 
`file_size_bytes` options. 
   
   Using the `duckdb` CLI,
   ```
   .timer on
   CREATE TABLE tbl AS SELECT * FROM read_parquet('table_10000000.parquet');
   
   SELECT COUNT(*) FROM tbl;
   
   COPY
       (SELECT * FROM tbl)
       TO 'parquet_files'
       (FORMAT PARQUET, PER_THREAD_OUTPUT true, FILE_SIZE_BYTES '512MB', 
OVERWRITE_OR_IGNORE true);
   ```
   Result,
   ```
   Run Time (s): real 14.588 user 66.250569 sys 8.554207
   ```
   ![Screenshot 2024-02-27 at 2 04 10 
PM](https://github.com/apache/iceberg-python/assets/9057843/5cf58399-3d6e-4c5a-8ee7-cafa973cddd1)
   ![Screenshot 2024-02-27 at 2 04 22 
PM](https://github.com/apache/iceberg-python/assets/9057843/ed76e2c7-64fd-4509-b30f-698da258ecd4)
   
   
   And setting FILE_SIZE_BYTES to 256MB,
   ```
   COPY
       (SELECT * FROM tbl)
       TO 'parquet_files'
       (FORMAT PARQUET, PER_THREAD_OUTPUT true, FILE_SIZE_BYTES '256MB', 
OVERWRITE_OR_IGNORE true);
   ```
   ```
   Run Time (s): real 15.575 user 66.483984 sys 10.547852
   ```
   ![Screenshot 2024-02-27 at 2 05 03 
PM](https://github.com/apache/iceberg-python/assets/9057843/3a911fd2-bf79-46c1-b64c-98195ad75a48)
   ![Screenshot 2024-02-27 at 2 05 40 
PM](https://github.com/apache/iceberg-python/assets/9057843/1dc68565-8205-4dc7-ad49-0b9abb534de8)
   
   
   I'm not sure if there's a way to specify the number of threads Duckdb can 
use. But with `htop` while executing the statements, I can see that all the 
cores are used
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Parallel Table.append [iceberg-python]

Reply via email to