kevinjqliu closed issue #1032: Peformance question for to_arrow, to_pandas,
to_duckdb
URL: https://github.com/apache/iceberg-python/issues/1032
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the sp
kevinjqliu commented on issue #1032:
URL:
https://github.com/apache/iceberg-python/issues/1032#issuecomment-2287282201
Thanks for reporting this. I learned a lot from exploring this thread, and
we have some solid improvements coming up. Please let us know if anything else
comes up!
--
T
jkleinkauff commented on issue #1032:
URL:
https://github.com/apache/iceberg-python/issues/1032#issuecomment-2286166233
@sungwy @kevinjqliu I'm really enjoying this discussion and learning a ton
from it. Would love to keep it going but feel free to close it as this is not
an issue. Thank y
kevinjqliu commented on issue #1032:
URL:
https://github.com/apache/iceberg-python/issues/1032#issuecomment-2285052713
> you're benchmarking the fsspec FileIO path in pyiceberg, which if I
understand correctly is using fsspec s3fs directly with a lot of defaults.
Probably it keeps the defa
kevinjqliu commented on issue #1032:
URL:
https://github.com/apache/iceberg-python/issues/1032#issuecomment-2285046586
> I have one more question regarding the read_parquet from awswrangler.
Do you know why it's faster than the other methods? Is there any
optimization on their end or som
jkleinkauff commented on issue #1032:
URL:
https://github.com/apache/iceberg-python/issues/1032#issuecomment-2284442737
@kevinjqliu that's awesome! Thank you so much !
I have one more question regarding the **read_parquet** from awswrangler.
Do you know why it's faster than the other
kevinjqliu commented on issue #1032:
URL:
https://github.com/apache/iceberg-python/issues/1032#issuecomment-2282819711
Thanks for looking into the different scenarios. It looks like there are
varying results depending on the engines.
### Read Path
I took a deeper look into the rea
jkleinkauff commented on issue #1032:
URL:
https://github.com/apache/iceberg-python/issues/1032#issuecomment-2281787859
Hi @kevinjqliu thank you for your time!
Those are my findings:
I've included a read_parquet method from awswrangler. Don't know why, but
it's by far the fast
kevinjqliu commented on issue #1032:
URL:
https://github.com/apache/iceberg-python/issues/1032#issuecomment-2278578863
okay, this doesn't look like an issue with reading many metadata files.
I wonder if the `limit` is respected for table scans.
Things I want to compare
* readin
jkleinkauff commented on issue #1032:
URL:
https://github.com/apache/iceberg-python/issues/1032#issuecomment-2278564297
Hey, thank you for taking a time to answer me!
1. My files are in S3.
2. Sure! It's something I could do on my end? Do you have any recommendation
on that?
(I
kevinjqliu commented on issue #1032:
URL:
https://github.com/apache/iceberg-python/issues/1032#issuecomment-2278556792
There's a nontrivial cost in reading metadata files in Iceberg.
Can you run this command,
```
table.inspect.manifests().to_pandas()
```
This will show the nu
sungwy commented on issue #1032:
URL:
https://github.com/apache/iceberg-python/issues/1032#issuecomment-2278456915
Hi @jkleinkauff , that's indeed an interesting observation.
I have some follow up questions to help us understand it better.
1. Where are your files stored?
2. Is t
jkleinkauff opened a new issue, #1032:
URL: https://github.com/apache/iceberg-python/issues/1032
### Question
Hey, thanks for this very convenient library.
This is not a bug, just want to better understand something.
I have a question regarding the performance - ie time t
13 matches
Mail list logo