Re: [I] Peformance question for to_arrow, to_pandas, to_duckdb [iceberg-python]

via GitHub Mon, 12 Aug 2024 16:07:39 -0700


kevinjqliu commented on issue #1032:
URL: 
https://github.com/apache/iceberg-python/issues/1032#issuecomment-2285046586


   > I have one more question regarding the read_parquet from awswrangler.
   Do you know why it's faster than the other methods? Is there any 
optimization on their end or something?
   
   I was also surprised by the performance difference. It's hard for me to say 
unless I look into the implementation details (in 
[awswrangler/s3/_read_parquet.py](https://github.com/aws/aws-sdk-pandas/blob/8d0c071649fb9e603a2ab2846307f902fafeabf5/awswrangler/s3/_read_parquet.py#L318)).
 There's definitely room for optimizations on the PyIceberg side. 
   
   If you look at another engine like 
[daft](https://www.getdaft.io/projects/docs/en/latest/user_guide/basic_concepts/read-and-write.html#from-files),
 which is optimized for reading parquet on S3, that's a good target for 
potential performance gains. 
   
   On the PyIceberg side, there's a future opportunity to integrate with 
iceberg-rust, which might speed up reading files. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Peformance question for to_arrow, to_pandas, to_duckdb [iceberg-python]

Reply via email to