[PR] feat: Safer PartitionSpec & SchemalessPartitionSpec [iceberg-rust]

via GitHub Mon, 23 Sep 2024 13:24:54 -0700


c-thiel opened a new pull request, #645:
URL: https://github.com/apache/iceberg-rust/pull/645

Fixes https://github.com/apache/iceberg-rust/issues/550

This PR is a result of the issue mentioned above, a [Slack Discussion](
https://apache-iceberg.slack.com/archives/C03LG1D563F/p1725878870301909) and
the preparation for the new MetadataBuilder
https://github.com/apache/iceberg-rust/pull/587.

Lets first establish, that `PartitionSpec` should have a bound schema
because:
* We need to bind the schema to to `PartitonSpec` to allow internal
partition id handling (see @rdblue comment in slack).
* It allows for consistent handling of schema related operations, such as
determining the partition type without returning a `Result`.
* A PartitionSpec is only valid for the schema is had been build against,
and we should not imply otherwise.

If we agree on this, then we have a small Problem with `TableMetadata`: It
contains historic PartitionSpecs that cannot be bound against the
`current_schema`. As a result, we need a type that has the same properties as
`PartitionSpec` but is not bound to a schema. I thought we could use
`UnboundPartitionSpec` for this at first, but it serves a different purpose and
would make the very important `field_id` Optional.

To solve this, I introduce a `SchemalessPartitionSpec`. Common functions
between `PartitionSpec` and `SchemalessPartitionSpec` are maid available via a
common trait. While the trait is public and as such allows to develop functions
that take either variant as an input, I am exposing the most important
attributes (i.e. `fields`) still directly, so that the trait typically doesn't
need to be imported.

For reference, Java solves this by force-binding potentially non-compatible
schemas to a `PartitionSpec`
* `TableMetadataBuilder`:
https://github.com/apache/iceberg/blob/5a2c1c9c8dd8d61e08bbad713c24e6b726eee323/core/src/main/java/org/apache/iceberg/TableMetadata.java#L734
* `TableMetadataParser`:
https://github.com/apache/iceberg/blob/5a2c1c9c8dd8d61e08bbad713c24e6b726eee323/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L393

Happy to hear your opinions! This really is a bit of a fiddly topic. I don't
like to introduce a new type, on the other hand using the API feels more
natural to me now. This can also be seen in the tests, where we had a
significant number of places where we passed `PartitionSpec` alongside its
`Schema` around, just to call `partition_type` eventually.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] feat: Safer PartitionSpec & SchemalessPartitionSpec [iceberg-rust]

Reply via email to