kevinjqliu commented on issue #428: URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1967707048
As a way to benchmark multithreaded writes to multiple parquet files, I've noticed that Duckdb's COPY command has the `per_thread_output` and `file_size_bytes` options. Using the `duckdb` CLI, ``` .timer on CREATE TABLE tbl AS SELECT * FROM read_parquet('table_10000000.parquet'); SELECT COUNT(*) FROM tbl; COPY (SELECT * FROM tbl) TO 'parquet_files' (FORMAT PARQUET, PER_THREAD_OUTPUT true, FILE_SIZE_BYTES '512MB', OVERWRITE_OR_IGNORE true); ``` Result, ``` Run Time (s): real 14.588 user 66.250569 sys 8.554207 ```   And setting FILE_SIZE_BYTES to 256MB, ``` COPY (SELECT * FROM tbl) TO 'parquet_files' (FORMAT PARQUET, PER_THREAD_OUTPUT true, FILE_SIZE_BYTES '256MB', OVERWRITE_OR_IGNORE true); ``` ``` Run Time (s): real 15.575 user 66.483984 sys 10.547852 ```   I'm not sure if there's a way to specify the number of threads Duckdb can use. But with `htop` while executing the statements, I can see that all the cores are used -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org