jkleinkauff commented on issue #1032: URL: https://github.com/apache/iceberg-python/issues/1032#issuecomment-2278564297
Hey, thank you for taking a time to answer me! 1. My files are in S3. 2. Sure! It's something I could do on my end? Do you have any recommendation on that? (I'm not sure if it's the same, running a download profiler (I've written something using psutil) in on file it takes something between 25s to complete it) Yeah, even with limit=1 it seems scan is returning both files (just an observation, maybe it's intended): ```python df = table.scan(limit=1) # pa_table = df.to_arrow() [print(task.file.file_path) for task in df.plan_files()] # s3://xxx/xxx/curitiba_starts_june/data/00000-0-6984da88-fe64-4765-9137-739072becfb1.parquet # s3://xxx/xxx/curitiba_starts_june/data/00000-0-1de29b8f-2e8c-4543-9663-f769d53b17b7.parquet ``` Output of table.inspect.manifests().to_pandas() ```python ❯ python pyiceberg_duckdb.py content path length partition_spec_id ... added_delete_files_count existing_delete_files_count deleted_delete_files_count partition_summaries 0 0 s3://data-lake-jho/bronze/curitiba_starts_june... 10433 0 ... 0 0 0 [] 1 0 s3://data-lake-jho/bronze/curitiba_starts_june... 10430 0 ... 0 0 0 [] [2 rows x 12 columns] ``` I can also share the files or a direct link to my files. Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org