Fokko commented on issue #694: URL: https://github.com/apache/iceberg-rust/issues/694#issuecomment-2480605287
Unbound (not bound to a schema) and schemaless are the same, so I find them confusing. Regarding https://github.com/apache/iceberg-rust/pull/645#issue-2543573501 there are some incorrect assumptions there: > If we agree on this, then we have a small Problem with TableMetadata: It contains historic PartitionSpecs that cannot be bound against the current_schema. As a result, we need a type that has the same properties as PartitionSpec but is not bound to a schema. I thought we could use UnboundPartitionSpec for this at first, but it serves a different purpose and would make the very important `field_id` Optional. I don't understand this statement. If I look at the `UnboundPartitionSpec`: ```rs #[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, TypedBuilder)] #[serde(rename_all = "kebab-case")] pub struct UnboundPartitionField { /// A source column id from the table’s schema pub source_id: i32, /// A partition field id that is used to identify a partition field and is unique within a partition spec. /// In v2 table metadata, it is unique across all partition specs. #[builder(default, setter(strip_option))] pub field_id: Option<i32>, /// A partition name. pub name: String, /// A transform that is applied to the source column to produce a partition value. pub transform: Transform, } /// Unbound partition spec can be built without a schema and later bound to a schema. #[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, Default)] #[serde(rename_all = "kebab-case")] pub struct UnboundPartitionSpec { /// Identifier for PartitionSpec pub(crate) spec_id: Option<i32>, /// Details of the partition spec pub(crate) fields: Vec<UnboundPartitionField>, } ``` > A PartitionSpec is only valid for the schema is had been build against, and we should not imply otherwise. This is not entirely correct. For example, when a field is being promoted in a valid way, then it is still valid to bind against the same source-id. So going from the problem space to the solution space. I was looking at the Java code and the PyIceberg code, and at PyIceberg we took a slightly different approach that might be interesting for Iceberg-rust as well. As mentioned earlier, the source ID might evolve (assume compatible ones `int -> long`, `float -> double`, etc) over time. The most obvious example is the identity-partition where the `sourceType` is equal to the `resultType`. Therefore we allow binding with different schema's in PyIceberg: https://github.com/apache/iceberg-python/blob/b2f0a9e5cd7dd548e19cdcdd7f9205f03454369a/pyiceberg/partitioning.py#L203-L225 I think that might be a better approach for Rust as well. When the field is not in the current spec anymore, it is also unlikely that you would filter on that field, therefore that field from the partition-spec could also be ignored. The downside is that with strongly typed languages, we need to know the type upfront, and otherwise we only know it when we read the actual Avro file (that has the schema with the type). Therefore I'm playing around with a solution for this on the Java side: https://github.com/apache/iceberg/pull/11542 I hope this helps! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org