Re: [I] Peformance question for to_arrow, to_pandas, to_duckdb [iceberg-python]

via GitHub Fri, 09 Aug 2024 10:56:40 -0700


sungwy commented on issue #1032:
URL: 
https://github.com/apache/iceberg-python/issues/1032#issuecomment-2278456915


   Hi @jkleinkauff , that's indeed an interesting observation.
   
   I have some follow up questions to help us understand it better.
   1. Where are your files stored?
   2. Is there a way we can profile your IO and plot it against your IO 
Download limit?
   
   As a way of comparison, I just ran a scan using to_arrow against a table 
that has 63, 5.5Mb Parquet files comprising the table. I'd imagine a table with 
less files to take less time to return (although the limit function here should 
ensure that we aren't even reading the other parquet files past the first one)
   
   It returned in 6 seconds.
   
   
   Your observation that limit 1~100 took similar times makes sense to me as 
well. If you have 100+ Mb files, you are going to have to download the same 
amount of data regardless to return the limited result.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Peformance question for to_arrow, to_pandas, to_duckdb [iceberg-python]

Reply via email to