Re: [I] Why shouldn't we return an `UnboundPartitionSpec` instead? [iceberg-rust]

via GitHub Sat, 16 Nov 2024 15:48:02 -0800


c-thiel commented on issue #694:
URL: https://github.com/apache/iceberg-rust/issues/694#issuecomment-2480849988

> Unbound (not bound to a schema) and schemaless are the same, so I find
them confusing.

They are not quite - a previously bound schema has a required `field_id`
(reference to `PartitionField`), while unbound has an optional `field_id`
(`Option<i32>`) as it references `UnboundPartitionField`.
`TableMetadata.partition_specs` requires `field_id`, so we shouldn't use
`UnboundPartitionSpec` there.

>> A PartitionSpec is only valid for the schema is had been build against,
and we should not imply otherwise.

> This is not entirely correct. For example, when a field is being promoted
in a valid way, then it is still valid to bind against the same source-id.

True, that was too strict. What I should have expressed it that it is not
guaranteed to be compatible with the current schema, which is why Java uses
`bindUnchecked()`.
I think with the current implementation we are quite close to what you
propose - binding a `SchemalessPartitionSpec` or `UnboundPartitionSpec` to a
new schema is easily possible by using `spec.to_unbound().bind(new_schema)`.
Currently this binding can fail. What is missing is the notion that you
described with

> When the field is not in the current spec anymore, it is also unlikely
that you would filter on that field, therefore that field from the
partition-spec could also be ignored

This would require a more permissive `bind` method, or an additional flag,
that would ignore unknown fields, similar to what you try to incorporate into
Java with `Any`. So re-binding is already possible today.

The `BoundPartitionSpec` serves two additional purposes - it is a way to
compute the binding and store the result typesafe. This is useful for
1. Pre-computing the bind, as we currently do for `default_spec` against the
`current_schema` - which is the most important combination.
2. Have the means to get the source field names for a partition specs. This
is useful when a spec is transferred from one context to another. In Java this
is used for `addPartitionSpec` which eventually runs this code:
https://github.com/apache/iceberg/blob/acd7cc1126b192ccb53ad8198bda37e983aa4c6c/core/src/main/java/org/apache/iceberg/TableMetadata.java#L768.
Note that currently we don't have this semantic in the builder, which in java
performs name-mapping to potentially change the `source_id`. Instead we pass
an `UnboundSpec` which trusts the provides `source_id` and not the field name.

Summing it up, my approach would be to keep all three types - I simply don't
see a good way around them.
However, we should add a more permissive bind method that incorporates
better the semantic of binding to an incompatible schema. This would be an
alternative to the `bindUnchecked()` in Java. Maybe we can just follow the PR
you are working on in Java once that's ready. This method would not necessarily
be able to determine a type for each field in a `spec`.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Why shouldn't we return an `UnboundPartitionSpec` instead? [iceberg-rust]

Reply via email to