Re: [I] Peformance question for to_arrow, to_pandas, to_duckdb [iceberg-python]

2024-08-13 Thread via GitHub
kevinjqliu closed issue #1032: Peformance question for to_arrow, to_pandas, to_duckdb URL: https://github.com/apache/iceberg-python/issues/1032 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

Re: [I] Peformance question for to_arrow, to_pandas, to_duckdb [iceberg-python]

2024-08-13 Thread via GitHub
kevinjqliu commented on issue #1032: URL: https://github.com/apache/iceberg-python/issues/1032#issuecomment-2287282201 Thanks for reporting this. I learned a lot from exploring this thread, and we have some solid improvements coming up. Please let us know if anything else comes up! -- T

Re: [I] Peformance question for to_arrow, to_pandas, to_duckdb [iceberg-python]

2024-08-13 Thread via GitHub
jkleinkauff commented on issue #1032: URL: https://github.com/apache/iceberg-python/issues/1032#issuecomment-2286166233 @sungwy @kevinjqliu I'm really enjoying this discussion and learning a ton from it. Would love to keep it going but feel free to close it as this is not an issue. Thank y

Re: [I] Peformance question for to_arrow, to_pandas, to_duckdb [iceberg-python]

2024-08-12 Thread via GitHub
kevinjqliu commented on issue #1032: URL: https://github.com/apache/iceberg-python/issues/1032#issuecomment-2285052713 > you're benchmarking the fsspec FileIO path in pyiceberg, which if I understand correctly is using fsspec s3fs directly with a lot of defaults. Probably it keeps the defa

Re: [I] Peformance question for to_arrow, to_pandas, to_duckdb [iceberg-python]

2024-08-12 Thread via GitHub
kevinjqliu commented on issue #1032: URL: https://github.com/apache/iceberg-python/issues/1032#issuecomment-2285046586 > I have one more question regarding the read_parquet from awswrangler. Do you know why it's faster than the other methods? Is there any optimization on their end or som

Re: [I] Peformance question for to_arrow, to_pandas, to_duckdb [iceberg-python]

2024-08-12 Thread via GitHub
jkleinkauff commented on issue #1032: URL: https://github.com/apache/iceberg-python/issues/1032#issuecomment-2284442737 @kevinjqliu that's awesome! Thank you so much ! I have one more question regarding the **read_parquet** from awswrangler. Do you know why it's faster than the other

Re: [I] Peformance question for to_arrow, to_pandas, to_duckdb [iceberg-python]

2024-08-11 Thread via GitHub
kevinjqliu commented on issue #1032: URL: https://github.com/apache/iceberg-python/issues/1032#issuecomment-2282819711 Thanks for looking into the different scenarios. It looks like there are varying results depending on the engines. ### Read Path I took a deeper look into the rea

Re: [I] Peformance question for to_arrow, to_pandas, to_duckdb [iceberg-python]

2024-08-10 Thread via GitHub
jkleinkauff commented on issue #1032: URL: https://github.com/apache/iceberg-python/issues/1032#issuecomment-2281787859 Hi @kevinjqliu thank you for your time! Those are my findings: I've included a read_parquet method from awswrangler. Don't know why, but it's by far the fast

Re: [I] Peformance question for to_arrow, to_pandas, to_duckdb [iceberg-python]

2024-08-09 Thread via GitHub
kevinjqliu commented on issue #1032: URL: https://github.com/apache/iceberg-python/issues/1032#issuecomment-2278578863 okay, this doesn't look like an issue with reading many metadata files. I wonder if the `limit` is respected for table scans. Things I want to compare * readin

Re: [I] Peformance question for to_arrow, to_pandas, to_duckdb [iceberg-python]

2024-08-09 Thread via GitHub
jkleinkauff commented on issue #1032: URL: https://github.com/apache/iceberg-python/issues/1032#issuecomment-2278564297 Hey, thank you for taking a time to answer me! 1. My files are in S3. 2. Sure! It's something I could do on my end? Do you have any recommendation on that? (I

Re: [I] Peformance question for to_arrow, to_pandas, to_duckdb [iceberg-python]

2024-08-09 Thread via GitHub
kevinjqliu commented on issue #1032: URL: https://github.com/apache/iceberg-python/issues/1032#issuecomment-2278556792 There's a nontrivial cost in reading metadata files in Iceberg. Can you run this command, ``` table.inspect.manifests().to_pandas() ``` This will show the nu

Re: [I] Peformance question for to_arrow, to_pandas, to_duckdb [iceberg-python]

2024-08-09 Thread via GitHub
sungwy commented on issue #1032: URL: https://github.com/apache/iceberg-python/issues/1032#issuecomment-2278456915 Hi @jkleinkauff , that's indeed an interesting observation. I have some follow up questions to help us understand it better. 1. Where are your files stored? 2. Is t

[I] Peformance question for to_arrow, to_pandas, to_duckdb [iceberg-python]

2024-08-09 Thread via GitHub
jkleinkauff opened a new issue, #1032: URL: https://github.com/apache/iceberg-python/issues/1032 ### Question Hey, thanks for this very convenient library. This is not a bug, just want to better understand something. I have a question regarding the performance - ie time t