Re: [I] Peformance question for to_arrow, to_pandas, to_duckdb [iceberg-python]

via GitHub Fri, 09 Aug 2024 12:10:19 -0700


kevinjqliu commented on issue #1032:
URL: 
https://github.com/apache/iceberg-python/issues/1032#issuecomment-2278578863


   okay, this doesn't look like an issue with reading many metadata files.
   
   I wonder if the `limit` is respected for table scans. 
   Things I want to compare
   * reading raw parquet file with pyarrow
   * reading entire iceberg table, without limits
   * reading iceberg table, with limit of 1 
   * reading iceberg table with duckdb
   * reading iceberg table with duckdb, with limit of 1
   
   I think this will give us some insights about read performance in pyiceberg
   
   For reading raw parquet files, you can do something like this,
   ```
   import pyarrow.parquet as pq
   import time
   
   parquet_file_path = ""
   start_time = time.time()
   table = pq.read_table(parquet_file_path)
   end_time = time.time()
   time_taken = end_time - start_time
   print(f"Time taken to read the Parquet file: {time_taken} seconds")
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Peformance question for to_arrow, to_pandas, to_duckdb [iceberg-python]

Reply via email to