Re: [I] Peformance question for to_arrow, to_pandas, to_duckdb [iceberg-python]

via GitHub Fri, 09 Aug 2024 12:01:15 -0700


jkleinkauff commented on issue #1032:
URL: 
https://github.com/apache/iceberg-python/issues/1032#issuecomment-2278564297


   Hey, thank you for taking a time to answer me!
   
   1. My files are in S3.
   2. Sure! It's something I could do on my end? Do you have any recommendation 
on that?
   (I'm not sure if it's the same, running a download profiler (I've written 
something using psutil) in on file it takes something between 25s to complete 
it)
   
   Yeah, even with limit=1 it seems scan is returning both files (just an 
observation, maybe it's intended):
   
   ```python
       df = table.scan(limit=1)
       # pa_table = df.to_arrow()
       [print(task.file.file_path) for task in df.plan_files()]
   # 
s3://xxx/xxx/curitiba_starts_june/data/00000-0-6984da88-fe64-4765-9137-739072becfb1.parquet
   # 
s3://xxx/xxx/curitiba_starts_june/data/00000-0-1de29b8f-2e8c-4543-9663-f769d53b17b7.parquet
 
   
   ```
   
   Output of table.inspect.manifests().to_pandas()
   
   ```python
   ❯ python pyiceberg_duckdb.py
      content                                               path  length  
partition_spec_id  ...  added_delete_files_count  existing_delete_files_count  
deleted_delete_files_count  partition_summaries
   0        0  s3://data-lake-jho/bronze/curitiba_starts_june...   10433        
          0  ...                         0                            0         
                  0                   []
   1        0  s3://data-lake-jho/bronze/curitiba_starts_june...   10430        
          0  ...                         0                            0         
                  0                   []
   
   [2 rows x 12 columns]
   ```
   I can also share the files or a direct link to my files. Thank you!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Peformance question for to_arrow, to_pandas, to_duckdb [iceberg-python]

Reply via email to