Re: [I] Expose PyIceberg table as PyArrow Dataset [iceberg-python]

via GitHub Thu, 31 Oct 2024 12:56:30 -0700


corleyma commented on issue #30:
URL: https://github.com/apache/iceberg-python/issues/30#issuecomment-2450705382


   @kevinjqliu alas it's not as simple for iceberg because of the need to do 
field id-based projection to handle schema evolution.  
   
   Somewhat relatedly: from what I remember, and assuming nothing has changed, 
PyArrow Datasets can't have fragments with different input schema.  PyArrow 
dataset assumes a dataset has a single schema and that it matches the incoming 
schema of the fragments, so e.g. you can't specify per-fragment projections.   
   
   I know @kevinjqliu understands this already, but for others who may want to 
pick up work on this in the future, here's what this means in practice:
   Say you're reading an iceberg table snapshot, and the schema of that 
snapshot should be the schema of the PyArrow dataset (the "output schema" of 
the dataset scan).  Since schema changes in Iceberg are metadata-only 
operations, it's possible that your table has e.g. historical partitions 
written with an older schema that have not been re-written.  An 
Iceberg-compliant reader is supposed to project these old files to the new 
schema at readtime.  That means, in PyArrow Dataset terminology, you need 
fragments with potentially different schemas to be projected to the final 
schema of the dataset.  And the set of things this has to account for includes 
e.g. column renames (would be handled by field id-based projection if pyarrow 
supported it), but also e.g. type widening (so e.g., old fragments have column 
foo stored as an int, new fragments store it as long after a schema update; 
pyarrow dataset needs to cast that column to long when reading the old 
fragments).
   
   There was a discussion long ago about creating a [PyArrow Dataset 
protocol](https://github.com/apache/arrow/issues/37504), so that third parties 
could implement scanners with their own logic around an interface that accepts 
projections and filters expressed as substrait predicates, and returns one or 
more streams of arrow data.  Libraries that consume PyArrow datasets would be 
able to consume these other implementations using the PyCapsule protocol 
without needing any special integration logic.
   
   It hasn't been adopted, but it would indeed allow pyiceberg to define a 
dataset that does whatever it wants.  However, unless pyiceberg lifted much of 
the underlying implementation into iceberg-rust, a pure-python dataset 
implementation might not be able to achieve the performance folks want.  In a 
world where pyiceberg continues to rely mostly on pyarrow for native C++ 
implementations of performance-critical logic, pyiceberg would probably benefit 
most from extensions to the dataset implementation in PyArrow allowing folks to 
specify per-fragment projections to the final dataset output schema.  That way 
all the actual concurrent scanning logic could be farmed out to pyarrow's 
native implementations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Expose PyIceberg table as PyArrow Dataset [iceberg-python]

Reply via email to