Re: [I] Why shouldn't we return an `UnboundPartitionSpec` instead? [iceberg-rust]

via GitHub Sat, 16 Nov 2024 07:01:04 -0800


Fokko commented on issue #694:
URL: https://github.com/apache/iceberg-rust/issues/694#issuecomment-2480605287

Unbound (not bound to a schema) and schemaless are the same, so I find them
confusing.

Regarding https://github.com/apache/iceberg-rust/pull/645#issue-2543573501
there are some incorrect assumptions there:

> If we agree on this, then we have a small Problem with TableMetadata: It
contains historic PartitionSpecs that cannot be bound against the
current_schema. As a result, we need a type that has the same properties as
PartitionSpec but is not bound to a schema. I thought we could use
UnboundPartitionSpec for this at first, but it serves a different purpose and
would make the very important `field_id` Optional.

I don't understand this statement. If I look at the `UnboundPartitionSpec`:

```rs
#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, TypedBuilder)]
#[serde(rename_all = "kebab-case")]
pub struct UnboundPartitionField {
/// A source column id from the table’s schema
pub source_id: i32,
/// A partition field id that is used to identify a partition field and
is unique within a partition spec.
/// In v2 table metadata, it is unique across all partition specs.
#[builder(default, setter(strip_option))]
pub field_id: Option<i32>,
/// A partition name.
pub name: String,
/// A transform that is applied to the source column to produce a
partition value.
pub transform: Transform,
}

/// Unbound partition spec can be built without a schema and later bound to
a schema.
#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, Default)]
#[serde(rename_all = "kebab-case")]
pub struct UnboundPartitionSpec {
/// Identifier for PartitionSpec
pub(crate) spec_id: Option<i32>,
/// Details of the partition spec
pub(crate) fields: Vec<UnboundPartitionField>,
}
```

> A PartitionSpec is only valid for the schema is had been build against,
and we should not imply otherwise.

This is not entirely correct. For example, when a field is being promoted in
a valid way, then it is still valid to bind against the same source-id.

So going from the problem space to the solution space. I was looking at the
Java code and the PyIceberg code, and at PyIceberg we took a slightly different
approach that might be interesting for Iceberg-rust as well. As mentioned
earlier, the source ID might evolve (assume compatible ones `int -> long`,
`float -> double`, etc) over time. The most obvious example is the
identity-partition where the `sourceType` is equal to the `resultType`.
Therefore we allow binding with different schema's in PyIceberg:
https://github.com/apache/iceberg-python/blob/b2f0a9e5cd7dd548e19cdcdd7f9205f03454369a/pyiceberg/partitioning.py#L203-L225
I think that might be a better approach for Rust as well.

When the field is not in the current spec anymore, it is also unlikely that
you would filter on that field, therefore that field from the partition-spec
could also be ignored. The downside is that with strongly typed languages, we
need to know the type upfront, and otherwise we only know it when we read the
actual Avro file (that has the schema with the type). Therefore I'm playing
around with a solution for this on the Java side:
https://github.com/apache/iceberg/pull/11542

I hope this helps!

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Why shouldn't we return an `UnboundPartitionSpec` instead? [iceberg-rust]

Reply via email to