bigluck commented on issue #428: URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1963740242
@kevinjqliu, your latest changes are mind-blowing (https://github.com/apache/iceberg-python/issues/428#issuecomment-1962460623 for reference) I have tested your last changes on `c5ad.8xlarge` and `c5ad.16xlarge` instances using my `10,000,000` table. - On a `c5ad.8xlarge` instance with `10Gbps NIC`, `32 cores`, and a `64 RAM`, it took an average of `3.9s` to write `14` Parquet files. Previously, using `pynessie 0.6.0`, it took `31s`. - On a `c5ad.16xlarge` instance with `20Gbps NIC`, `64 cores`, and a `128 RAM`, it took approximately `3.6s` to complete the same task, compared to `28.2s` using `pynessie 0.6.0`. I have been experimenting with different settings to improve the writing performances, but I failed. I tried adjusting the `PYICEBERG_MAX_WORKERS` variable, but it did not make much difference. This might be due to the small size of my dataset (only `6.69 GB` in arrow format), which resulted in only 14 output files. I also tested the `write.target-file-size-bytes` property, which produced 27 files when set to `268435456` and 54 files when set to `134217728`. However, even when I set `PYICEBERG_MAX_WORKERS` to 64, the total write operation still took 3.6 seconds. Overall, I am very impressed with how it works now! Well done! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org