Fokko commented on issue #30: URL: https://github.com/apache/iceberg-python/issues/30#issuecomment-1922085729
Hey @jwills Having a union dataset feels like a step in the right direction to me, however I don't think it will really help when it comes to performance. Loading the files through PyArrow is very slow at the moment. The biggest issue there is that we aren't able to do the schema evolution in pure Arrow. That's why we materialize to a table, do all the changes needed to the schema, and then we concat all the tables in the end. This is very costly to do in Python. The main issue here is that Arrow does not support fetching schema's/filtering through [field-ids](https://github.com/apache/parquet-format/blob/3e33e0c0252cfaf0bc6821097ccaf1d7a3ce34e7/src/main/thrift/parquet.thrift#L459) which is the basis of Iceberg. A cleaner option would be to have the arrow dataset expose a protocol that we can implement. This was suggested a while ago, but they we're very reluctant on this and wanted to do everything through substrait. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org