bigluck commented on issue #428:
URL: https://github.com/apache/iceberg-python/issues/428#issuecomment-1963740242

    @kevinjqliu, your latest changes are mind-blowing 
(https://github.com/apache/iceberg-python/issues/428#issuecomment-1962460623 
for reference)
   
   I have tested your last changes on `c5ad.8xlarge` and `c5ad.16xlarge` 
instances using my `10,000,000` table.
   - On a `c5ad.8xlarge` instance with `10Gbps NIC`, `32 cores`, and a `64 
RAM`, it took an average of `3.9s` to write `14` Parquet files. Previously, 
using `pynessie 0.6.0`, it took `31s`.
   - On a `c5ad.16xlarge` instance with `20Gbps NIC`, `64 cores`, and a `128 
RAM`, it took approximately `3.6s` to complete the same task, compared to 
`28.2s` using `pynessie 0.6.0`.
   
   I have been experimenting with different settings to improve the writing 
performances, but I failed.
   I tried adjusting the `PYICEBERG_MAX_WORKERS` variable, but it did not make 
much difference. This might be due to the small size of my dataset (only `6.69 
GB` in arrow format), which resulted in only 14 output files.
   I also tested the `write.target-file-size-bytes` property, which produced 27 
files when set to `268435456` and 54 files when set to `134217728`.
   However, even when I set `PYICEBERG_MAX_WORKERS` to 64, the total write 
operation still took 3.6 seconds.
   
   Overall, I am very impressed with how it works now! Well done!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to