Fokko commented on issue #694:
URL: https://github.com/apache/iceberg-rust/issues/694#issuecomment-2480605287

   Unbound (not bound to a schema) and schemaless are the same, so I find them 
confusing. 
   
   Regarding https://github.com/apache/iceberg-rust/pull/645#issue-2543573501 
there are some incorrect assumptions there:
   
   > If we agree on this, then we have a small Problem with TableMetadata: It 
contains historic PartitionSpecs that cannot be bound against the 
current_schema. As a result, we need a type that has the same properties as 
PartitionSpec but is not bound to a schema. I thought we could use 
UnboundPartitionSpec for this at first, but it serves a different purpose and 
would make the very important `field_id` Optional.
   
   I don't understand this statement. If I look at the `UnboundPartitionSpec`:
   
   ```rs
   #[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, TypedBuilder)]
   #[serde(rename_all = "kebab-case")]
   pub struct UnboundPartitionField {
       /// A source column id from the table’s schema
       pub source_id: i32,
       /// A partition field id that is used to identify a partition field and 
is unique within a partition spec.
       /// In v2 table metadata, it is unique across all partition specs.
       #[builder(default, setter(strip_option))]
       pub field_id: Option<i32>,
       /// A partition name.
       pub name: String,
       /// A transform that is applied to the source column to produce a 
partition value.
       pub transform: Transform,
   }
   
   /// Unbound partition spec can be built without a schema and later bound to 
a schema.
   #[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, Default)]
   #[serde(rename_all = "kebab-case")]
   pub struct UnboundPartitionSpec {
       /// Identifier for PartitionSpec
       pub(crate) spec_id: Option<i32>,
       /// Details of the partition spec
       pub(crate) fields: Vec<UnboundPartitionField>,
   }
   ```
   
   > A PartitionSpec is only valid for the schema is had been build against, 
and we should not imply otherwise.
   
   This is not entirely correct. For example, when a field is being promoted in 
a valid way, then it is still valid to bind against the same source-id.
   
   So going from the problem space to the solution space. I was looking at the 
Java code and the PyIceberg code, and at PyIceberg we took a slightly different 
approach that might be interesting for Iceberg-rust as well. As mentioned 
earlier, the source ID might evolve (assume compatible ones `int -> long`, 
`float -> double`, etc) over time. The most obvious example is the 
identity-partition where the `sourceType` is equal to the `resultType`. 
Therefore we allow binding with different schema's in PyIceberg: 
https://github.com/apache/iceberg-python/blob/b2f0a9e5cd7dd548e19cdcdd7f9205f03454369a/pyiceberg/partitioning.py#L203-L225
 I think that might be a better approach for Rust as well.
   
   When the field is not in the current spec anymore, it is also unlikely that 
you would filter on that field, therefore that field from the partition-spec 
could also be ignored. The downside is that with strongly typed languages, we 
need to know the type upfront, and otherwise we only know it when we read the 
actual Avro file (that has the schema with the type). Therefore I'm playing 
around with a solution for this on the Java side: 
https://github.com/apache/iceberg/pull/11542
   
   I hope this helps!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to