Re: [I] Peformance question for to_arrow, to_pandas, to_duckdb [iceberg-python]

via GitHub Sat, 10 Aug 2024 06:32:23 -0700


jkleinkauff commented on issue #1032:
URL: 
https://github.com/apache/iceberg-python/issues/1032#issuecomment-2281787859


   Hi @kevinjqliu thank you for your time!
   
   Those are my findings:
   
   I've included a read_parquet method from awswrangler. Don't know why, but 
it's by far the fastest method.
   
   reading raw parquet file with awswrangler:
   
[read_raw_parquet_awswrangler.py](https://gist.github.com/jkleinkauff/acf9e876a72201c9530a0070dbb25e6e)
   ```
   1st run: Time taken to read the Parquet file: 16.10819697380066 seconds
   2nd run: Time taken to read the Parquet file: 16.06696915626526 seconds
   3rd run: Time taken to read the Parquet file: 14.28455901145935 seconds
   ```
   
   reading raw parquet file with pyarrow
   
[read_raw_parquet_pyarrow.py](https://gist.github.com/jkleinkauff/a31538c06ee1d4622a03cbf98a369ce8)
   ```
   1st run: Time taken to read the Parquet file: 39.86264896392822 seconds
   2nd run: Time taken to read the Parquet file: 39.484612226486206 seconds
   3rd run: Time taken to read the Parquet file: 26.693129062652588 seconds
   ```
   
   reading entire iceberg table, without limits
   
[read_iceberg_full.py](https://gist.github.com/jkleinkauff/f828c9a282180417db94dbc98497691f)
   ```
   1st run: Time taken to read the Iceberg table: 21.632921934127808 seconds
   2nd run: Time taken to read the Iceberg table: 36.94430899620056 seconds
   3rd run: Time taken to read the Iceberg table: 49.66138482093811 seconds
   ```
   
   reading iceberg table, with limit of 1
   
[read_iceberg_limit.py](https://gist.github.com/jkleinkauff/2e92add21bf84f18b5cb66d43aa7afe0)
   ```
   1st run: Time taken to read the Iceberg table: 45.886711835861206 seconds
   2nd run: Time taken to read the Iceberg table: 29.464744091033936 seconds
   3rd run: Time taken to read the Iceberg table: 44.78428387641907 seconds
   ```
   
   reading iceberg table with duckdb
   
[read_iceberg_duckdb.py](https://gist.github.com/jkleinkauff/9a48db4383516c13b6403c30468bd856)
   ```
   1st run: Time taken to read the Parquet file: 59.5912652015686 seconds
   2nd run: Time taken to read the Parquet file: 61.646626710891724 seconds
   3rd run: Time taken to read the Parquet file: 58.97534728050232 seconds
   
   Proxying through motherduck (con = duckdb.connect("md:db")):
   1st run: Time taken to read the Parquet file: 105.63072204589844 seconds
   2nd run: Time taken to read the Parquet file: 144.91437602043152 seconds
   3rd run: Time taken to read the Parquet file: 176.27135396003723 seconds
   ```
   
   eading iceberg table with duckdb, with limit of 1
   
[read_iceberg_duckdb_limit.py](https://gist.github.com/jkleinkauff/04883f509a8df8a27c6c808bede0b6c1)
   ```
   1st run: Time taken to read the Parquet file: 63.78661298751831 seconds
   2nd run: Time taken to read the Parquet file: 79.1733546257019 seconds
   3rd run: Time taken to read the Parquet file: 80.755441904068 seconds
   Proxying through motherduck:
   Why md is faster here? It somehow is pushing it
   1st run: Time taken to read the Parquet file: 3.524472951889038 seconds
   2nd run: Time taken to read the Parquet file: 3.4903008937835693 seconds
   3rd run: Time taken to read the Parquet file: 3.258246898651123 seconds
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Peformance question for to_arrow, to_pandas, to_duckdb [iceberg-python]

Reply via email to