corleyma commented on issue #30: URL: https://github.com/apache/iceberg-python/issues/30#issuecomment-2450705382
@kevinjqliu alas it's not as simple for iceberg because of the need to do field id-based projection to handle schema evolution. Somewhat relatedly: from what I remember, and assuming nothing has changed, PyArrow Datasets can't have fragments with different input schema. PyArrow dataset assumes a dataset has a single schema and that it matches the incoming schema of the fragments, so e.g. you can't specify per-fragment projections. I know @kevinjqliu understands this already, but for others who may want to pick up work on this in the future, here's what this means in practice: Say you're reading an iceberg table snapshot, and the schema of that snapshot should be the schema of the PyArrow dataset (the "output schema" of the dataset scan). Since schema changes in Iceberg are metadata-only operations, it's possible that your table has e.g. historical partitions written with an older schema that have not been re-written. An Iceberg-compliant reader is supposed to project these old files to the new schema at readtime. That means, in PyArrow Dataset terminology, you need fragments with potentially different schemas to be projected to the final schema of the dataset. And the set of things this has to account for includes e.g. column renames (would be handled by field id-based projection if pyarrow supported it), but also e.g. type widening (so e.g., old fragments have column foo stored as an int, new fragments store it as long after a schema update; pyarrow dataset needs to cast that column to long when reading the old fragments). There was a discussion long ago about creating a [PyArrow Dataset protocol](https://github.com/apache/arrow/issues/37504), so that third parties could implement scanners with their own logic around an interface that accepts projections and filters expressed as substrait predicates, and returns one or more streams of arrow data. Libraries that consume PyArrow datasets would be able to consume these other implementations using the PyCapsule protocol without needing any special integration logic. It hasn't been adopted, but it would indeed allow pyiceberg to define a dataset that does whatever it wants. However, unless pyiceberg lifted much of the underlying implementation into iceberg-rust, a pure-python dataset implementation might not be able to achieve the performance folks want. In a world where pyiceberg continues to rely mostly on pyarrow for native C++ implementations of performance-critical logic, pyiceberg would probably benefit most from extensions to the dataset implementation in PyArrow allowing folks to specify per-fragment projections to the final dataset output schema. That way all the actual concurrent scanning logic could be farmed out to pyarrow's native implementations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org