sungwy commented on issue #1032: URL: https://github.com/apache/iceberg-python/issues/1032#issuecomment-2278456915
Hi @jkleinkauff , that's indeed an interesting observation. I have some follow up questions to help us understand it better. 1. Where are your files stored? 2. Is there a way we can profile your IO and plot it against your IO Download limit? As a way of comparison, I just ran a scan using to_arrow against a table that has 63, 5.5Mb Parquet files comprising the table. I'd imagine a table with less files to take less time to return (although the limit function here should ensure that we aren't even reading the other parquet files past the first one) It returned in 6 seconds. Your observation that limit 1~100 took similar times makes sense to me as well. If you have 100+ Mb files, you are going to have to download the same amount of data regardless to return the limited result. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org