jkleinkauff commented on issue #1032: URL: https://github.com/apache/iceberg-python/issues/1032#issuecomment-2281787859
Hi @kevinjqliu thank you for your time! Those are my findings: I've included a read_parquet method from awswrangler. Don't know why, but it's by far the fastest method. reading raw parquet file with awswrangler: [read_raw_parquet_awswrangler.py](https://gist.github.com/jkleinkauff/acf9e876a72201c9530a0070dbb25e6e) ``` 1st run: Time taken to read the Parquet file: 16.10819697380066 seconds 2nd run: Time taken to read the Parquet file: 16.06696915626526 seconds 3rd run: Time taken to read the Parquet file: 14.28455901145935 seconds ``` reading raw parquet file with pyarrow [read_raw_parquet_pyarrow.py](https://gist.github.com/jkleinkauff/a31538c06ee1d4622a03cbf98a369ce8) ``` 1st run: Time taken to read the Parquet file: 39.86264896392822 seconds 2nd run: Time taken to read the Parquet file: 39.484612226486206 seconds 3rd run: Time taken to read the Parquet file: 26.693129062652588 seconds ``` reading entire iceberg table, without limits [read_iceberg_full.py](https://gist.github.com/jkleinkauff/f828c9a282180417db94dbc98497691f) ``` 1st run: Time taken to read the Iceberg table: 21.632921934127808 seconds 2nd run: Time taken to read the Iceberg table: 36.94430899620056 seconds 3rd run: Time taken to read the Iceberg table: 49.66138482093811 seconds ``` reading iceberg table, with limit of 1 [read_iceberg_limit.py](https://gist.github.com/jkleinkauff/2e92add21bf84f18b5cb66d43aa7afe0) ``` 1st run: Time taken to read the Iceberg table: 45.886711835861206 seconds 2nd run: Time taken to read the Iceberg table: 29.464744091033936 seconds 3rd run: Time taken to read the Iceberg table: 44.78428387641907 seconds ``` reading iceberg table with duckdb [read_iceberg_duckdb.py](https://gist.github.com/jkleinkauff/9a48db4383516c13b6403c30468bd856) ``` 1st run: Time taken to read the Parquet file: 59.5912652015686 seconds 2nd run: Time taken to read the Parquet file: 61.646626710891724 seconds 3rd run: Time taken to read the Parquet file: 58.97534728050232 seconds Proxying through motherduck (con = duckdb.connect("md:db")): 1st run: Time taken to read the Parquet file: 105.63072204589844 seconds 2nd run: Time taken to read the Parquet file: 144.91437602043152 seconds 3rd run: Time taken to read the Parquet file: 176.27135396003723 seconds ``` eading iceberg table with duckdb, with limit of 1 [read_iceberg_duckdb_limit.py](https://gist.github.com/jkleinkauff/04883f509a8df8a27c6c808bede0b6c1) ``` 1st run: Time taken to read the Parquet file: 63.78661298751831 seconds 2nd run: Time taken to read the Parquet file: 79.1733546257019 seconds 3rd run: Time taken to read the Parquet file: 80.755441904068 seconds Proxying through motherduck: Why md is faster here? It somehow is pushing it 1st run: Time taken to read the Parquet file: 3.524472951889038 seconds 2nd run: Time taken to read the Parquet file: 3.4903008937835693 seconds 3rd run: Time taken to read the Parquet file: 3.258246898651123 seconds ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org