Re: [I] Expose PyIceberg table as PyArrow Dataset [iceberg-python]

via GitHub Thu, 01 Feb 2024 11:35:43 -0800


Fokko commented on issue #30:
URL: https://github.com/apache/iceberg-python/issues/30#issuecomment-1922085729


   Hey @jwills Having a union dataset feels like a step in the right direction 
to me, however I don't think it will really help when it comes to performance.
   
   Loading the files through PyArrow is very slow at the moment. The biggest 
issue there is that we aren't able to do the schema evolution in pure Arrow. 
That's why we materialize to a table, do all the changes needed to the schema, 
and then we concat all the tables in the end. This is very costly to do in 
Python. The main issue here is that Arrow does not support fetching 
schema's/filtering through 
[field-ids](https://github.com/apache/parquet-format/blob/3e33e0c0252cfaf0bc6821097ccaf1d7a3ce34e7/src/main/thrift/parquet.thrift#L459)
 which is the basis of Iceberg.
   
   A cleaner option would be to have the arrow dataset expose a protocol that 
we can implement. This was suggested a while ago, but they we're very reluctant 
on this and wanted to do everything through substrait.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Expose PyIceberg table as PyArrow Dataset [iceberg-python]

Reply via email to