Fokko commented on code in PR #26:
URL: https://github.com/apache/iceberg-rust/pull/26#discussion_r1288043261


##########
crates/iceberg/src/spec/transform.rs:
##########
@@ -0,0 +1,766 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Transforms in iceberg.
+
+use crate::error::{Error, Result};
+use crate::spec::datatypes::{PrimitiveType, Type};
+use crate::ErrorKind;
+use serde::{Deserialize, Deserializer, Serialize, Serializer};
+use std::fmt::{Display, Formatter};
+use std::str::FromStr;
+
+/// Transform is used to transform predicates to partition predicates,
+/// in addition to transforming data values.
+///
+/// Deriving partition predicates from column predicates on the table data
+/// is used to separate the logical queries from physical storage: the
+/// partitioning can change and the correct partition filters are always
+/// derived from column predicates.
+///
+/// This simplifies queries because users don’t have to supply both logical
+/// predicates and partition predicates.
+///
+/// All transforms must return `null` for a `null` input value.
+#[derive(Debug, PartialEq, Eq, Clone, Copy)]
+pub enum Transform {
+    /// Source value, unmodified
+    ///
+    /// - Source type could be any type.
+    /// - Return type is the same with source type.
+    Identity,
+    /// Hash of value, mod `N`.
+    ///
+    /// Bucket partition transforms use a 32-bit hash of the source value.
+    /// The 32-bit hash implementation is the 32-bit Murmur3 hash, x86
+    /// variant, seeded with 0.
+    ///
+    /// Transforms are parameterized by a number of buckets, N. The hash mod
+    /// N must produce a positive value by first discarding the sign bit of
+    /// the hash value. In pseudo-code, the function is:
+    ///
+    /// ```text
+    /// def bucket_N(x) = (murmur3_x86_32_hash(x) & Integer.MAX_VALUE) % N
+    /// ```
+    ///
+    /// - Source type could be `int`, `long`, `decimal`, `date`, `time`,
+    ///   `timestamp`, `timestamptz`, `string`, `uuid`, `fixed`, `binary`.
+    /// - Return type is `int`.
+    Bucket(i32),
+    /// Value truncated to width `W`
+    ///
+    /// For `int`:
+    ///
+    /// - `v - (v % W)` remainders must be positive
+    /// - example: W=10: 1 → 0, -1 → -10
+    /// - note: The remainder, v % W, must be positive.
+    ///
+    /// For `long`:
+    ///
+    /// - `v - (v % W)` remainders must be positive
+    /// - example: W=10: 1 → 0, -1 → -10
+    /// - note: The remainder, v % W, must be positive.
+    ///
+    /// For `decimal`:
+    ///
+    /// - `scaled_W = decimal(W, scale(v)) v - (v % scaled_W)`
+    /// - example: W=50, s=2: 10.65 → 10.50
+    ///
+    /// For `string`:
+    ///
+    /// - Substring of length L: `v.substring(0, L)`
+    /// - example: L=3: iceberg → ice
+    /// - note: Strings are truncated to a valid UTF-8 string with no more
+    ///   than L code points.
+    ///
+    /// - Source type could be `int`, `long`, `decimal`, `string`
+    /// - Return type is the same with source type.
+    Truncate(i32),
+    /// Extract a date or timestamp year, as years from 1970
+    ///
+    /// - Source type could be `date`, `timestamp`, `timestamptz`
+    /// - Return type is `int`
+    Year,
+    /// Extract a date or timestamp month, as months from 1970-01-01
+    ///
+    /// - Source type could be `date`, `timestamp`, `timestamptz`
+    /// - Return type is `int`
+    Month,
+    /// Extract a date or timestamp day, as days from 1970-01-01
+    ///
+    /// - Source type could be `date`, `timestamp`, `timestamptz`
+    /// - Return type is `int`
+    Day,
+    /// Extract a timestamp hour, as hours from 1970-01-01 00:00:00
+    ///
+    /// - Source type could be `timestamp`, `timestamptz`
+    /// - Return type is `int`
+    Hour,
+    /// Always produces `null`
+    ///
+    /// The void transform may be used to replace the transform in an
+    /// existing partition field so that the field is effectively dropped in
+    /// v1 tables.
+    ///
+    /// - Source type could be any type..
+    /// - Return type is Source type.
+    Void,
+}
+
+impl Transform {
+    /// Get the return type of transform given the input type.
+    /// Returns `None` if it can't be transformed.
+    pub fn result_type(&self, input_type: &Type) -> Option<Type> {
+        match self {
+            Transform::Identity => {
+                if matches!(input_type, Type::Primitive(_)) {
+                    Some(input_type.clone())
+                } else {
+                    None
+                }
+            }
+            Transform::Void => Some(input_type.clone()),
+            Transform::Bucket(_) => {
+                if let Type::Primitive(p) = input_type {
+                    match p {
+                        PrimitiveType::Int
+                        | PrimitiveType::Long
+                        | PrimitiveType::Decimal { .. }
+                        | PrimitiveType::Date
+                        | PrimitiveType::Time
+                        | PrimitiveType::Timestamp
+                        | PrimitiveType::Timestamptz
+                        | PrimitiveType::String
+                        | PrimitiveType::Uuid
+                        | PrimitiveType::Fixed(_)
+                        | PrimitiveType::Binary => 
Some(Type::Primitive(PrimitiveType::Int)),
+                        _ => None,
+                    }
+                } else {
+                    None
+                }
+            }
+            Transform::Truncate(_) => {
+                if let Type::Primitive(p) = input_type {
+                    match p {
+                        PrimitiveType::Int
+                        | PrimitiveType::Long
+                        | PrimitiveType::String
+                        | PrimitiveType::Binary
+                        | PrimitiveType::Decimal { .. } => 
Some(input_type.clone()),
+                        _ => None,
+                    }
+                } else {
+                    None
+                }
+            }
+            Transform::Year | Transform::Month | Transform::Day => {
+                if let Type::Primitive(p) = input_type {
+                    match p {
+                        PrimitiveType::Timestamp
+                        | PrimitiveType::Timestamptz
+                        | PrimitiveType::Date => 
Some(Type::Primitive(PrimitiveType::Int)),
+                        _ => None,
+                    }
+                } else {
+                    None
+                }
+            }
+            Transform::Hour => {
+                if let Type::Primitive(p) = input_type {
+                    match p {
+                        PrimitiveType::Timestamp | PrimitiveType::Timestamptz 
=> {
+                            Some(Type::Primitive(PrimitiveType::Int))
+                        }
+                        _ => None,
+                    }
+                } else {
+                    None
+                }
+            }
+        }

Review Comment:
   In Java and Python we also have the UnknownTransform. It doesn't happen 
much, but people could use their custom transform. Or a transform that is not 
yet supported by `iceberg-rust`, such as a [multi-arg bucket 
transform](https://github.com/apache/iceberg/pull/8259). In this case we would 
fall back to an unknown transform, and we cannot take any advantage of 
partitioning (we fall back to a full table scan 😭)



##########
crates/iceberg/src/spec/transform.rs:
##########
@@ -0,0 +1,766 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Transforms in iceberg.
+
+use crate::error::{Error, Result};
+use crate::spec::datatypes::{PrimitiveType, Type};
+use crate::ErrorKind;
+use serde::{Deserialize, Deserializer, Serialize, Serializer};
+use std::fmt::{Display, Formatter};
+use std::str::FromStr;
+
+/// Transform is used to transform predicates to partition predicates,
+/// in addition to transforming data values.
+///
+/// Deriving partition predicates from column predicates on the table data
+/// is used to separate the logical queries from physical storage: the
+/// partitioning can change and the correct partition filters are always
+/// derived from column predicates.
+///
+/// This simplifies queries because users don’t have to supply both logical
+/// predicates and partition predicates.
+///
+/// All transforms must return `null` for a `null` input value.
+#[derive(Debug, PartialEq, Eq, Clone, Copy)]
+pub enum Transform {
+    /// Source value, unmodified
+    ///
+    /// - Source type could be any type.
+    /// - Return type is the same with source type.
+    Identity,
+    /// Hash of value, mod `N`.
+    ///
+    /// Bucket partition transforms use a 32-bit hash of the source value.
+    /// The 32-bit hash implementation is the 32-bit Murmur3 hash, x86
+    /// variant, seeded with 0.
+    ///
+    /// Transforms are parameterized by a number of buckets, N. The hash mod
+    /// N must produce a positive value by first discarding the sign bit of
+    /// the hash value. In pseudo-code, the function is:
+    ///
+    /// ```text
+    /// def bucket_N(x) = (murmur3_x86_32_hash(x) & Integer.MAX_VALUE) % N
+    /// ```
+    ///
+    /// - Source type could be `int`, `long`, `decimal`, `date`, `time`,
+    ///   `timestamp`, `timestamptz`, `string`, `uuid`, `fixed`, `binary`.
+    /// - Return type is `int`.
+    Bucket(i32),

Review Comment:
   This could also be an unsigned integer; the width should be positive



##########
crates/iceberg/src/spec/transform.rs:
##########
@@ -0,0 +1,766 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Transforms in iceberg.
+
+use crate::error::{Error, Result};
+use crate::spec::datatypes::{PrimitiveType, Type};
+use crate::ErrorKind;
+use serde::{Deserialize, Deserializer, Serialize, Serializer};
+use std::fmt::{Display, Formatter};
+use std::str::FromStr;
+
+/// Transform is used to transform predicates to partition predicates,
+/// in addition to transforming data values.
+///
+/// Deriving partition predicates from column predicates on the table data
+/// is used to separate the logical queries from physical storage: the
+/// partitioning can change and the correct partition filters are always
+/// derived from column predicates.
+///
+/// This simplifies queries because users don’t have to supply both logical
+/// predicates and partition predicates.
+///
+/// All transforms must return `null` for a `null` input value.
+#[derive(Debug, PartialEq, Eq, Clone, Copy)]
+pub enum Transform {
+    /// Source value, unmodified
+    ///
+    /// - Source type could be any type.
+    /// - Return type is the same with source type.
+    Identity,
+    /// Hash of value, mod `N`.
+    ///
+    /// Bucket partition transforms use a 32-bit hash of the source value.
+    /// The 32-bit hash implementation is the 32-bit Murmur3 hash, x86
+    /// variant, seeded with 0.
+    ///
+    /// Transforms are parameterized by a number of buckets, N. The hash mod
+    /// N must produce a positive value by first discarding the sign bit of
+    /// the hash value. In pseudo-code, the function is:
+    ///
+    /// ```text
+    /// def bucket_N(x) = (murmur3_x86_32_hash(x) & Integer.MAX_VALUE) % N
+    /// ```
+    ///
+    /// - Source type could be `int`, `long`, `decimal`, `date`, `time`,
+    ///   `timestamp`, `timestamptz`, `string`, `uuid`, `fixed`, `binary`.
+    /// - Return type is `int`.
+    Bucket(i32),
+    /// Value truncated to width `W`
+    ///
+    /// For `int`:
+    ///
+    /// - `v - (v % W)` remainders must be positive
+    /// - example: W=10: 1 → 0, -1 → -10
+    /// - note: The remainder, v % W, must be positive.
+    ///
+    /// For `long`:
+    ///
+    /// - `v - (v % W)` remainders must be positive
+    /// - example: W=10: 1 → 0, -1 → -10
+    /// - note: The remainder, v % W, must be positive.
+    ///
+    /// For `decimal`:
+    ///
+    /// - `scaled_W = decimal(W, scale(v)) v - (v % scaled_W)`
+    /// - example: W=50, s=2: 10.65 → 10.50
+    ///
+    /// For `string`:
+    ///
+    /// - Substring of length L: `v.substring(0, L)`
+    /// - example: L=3: iceberg → ice
+    /// - note: Strings are truncated to a valid UTF-8 string with no more
+    ///   than L code points.
+    ///
+    /// - Source type could be `int`, `long`, `decimal`, `string`
+    /// - Return type is the same with source type.
+    Truncate(i32),
+    /// Extract a date or timestamp year, as years from 1970
+    ///
+    /// - Source type could be `date`, `timestamp`, `timestamptz`
+    /// - Return type is `int`
+    Year,
+    /// Extract a date or timestamp month, as months from 1970-01-01
+    ///
+    /// - Source type could be `date`, `timestamp`, `timestamptz`
+    /// - Return type is `int`
+    Month,
+    /// Extract a date or timestamp day, as days from 1970-01-01
+    ///
+    /// - Source type could be `date`, `timestamp`, `timestamptz`
+    /// - Return type is `int`
+    Day,
+    /// Extract a timestamp hour, as hours from 1970-01-01 00:00:00
+    ///
+    /// - Source type could be `timestamp`, `timestamptz`
+    /// - Return type is `int`
+    Hour,
+    /// Always produces `null`
+    ///
+    /// The void transform may be used to replace the transform in an
+    /// existing partition field so that the field is effectively dropped in
+    /// v1 tables.
+    ///
+    /// - Source type could be any type..
+    /// - Return type is Source type.
+    Void,
+}
+
+impl Transform {
+    /// Get the return type of transform given the input type.
+    /// Returns `None` if it can't be transformed.
+    pub fn result_type(&self, input_type: &Type) -> Option<Type> {
+        match self {
+            Transform::Identity => {
+                if matches!(input_type, Type::Primitive(_)) {
+                    Some(input_type.clone())
+                } else {
+                    None
+                }
+            }
+            Transform::Void => Some(input_type.clone()),
+            Transform::Bucket(_) => {
+                if let Type::Primitive(p) = input_type {
+                    match p {
+                        PrimitiveType::Int
+                        | PrimitiveType::Long
+                        | PrimitiveType::Decimal { .. }
+                        | PrimitiveType::Date
+                        | PrimitiveType::Time
+                        | PrimitiveType::Timestamp
+                        | PrimitiveType::Timestamptz
+                        | PrimitiveType::String
+                        | PrimitiveType::Uuid
+                        | PrimitiveType::Fixed(_)
+                        | PrimitiveType::Binary => 
Some(Type::Primitive(PrimitiveType::Int)),
+                        _ => None,
+                    }
+                } else {
+                    None
+                }
+            }
+            Transform::Truncate(_) => {
+                if let Type::Primitive(p) = input_type {
+                    match p {
+                        PrimitiveType::Int
+                        | PrimitiveType::Long
+                        | PrimitiveType::String
+                        | PrimitiveType::Binary
+                        | PrimitiveType::Decimal { .. } => 
Some(input_type.clone()),
+                        _ => None,
+                    }
+                } else {
+                    None
+                }
+            }
+            Transform::Year | Transform::Month | Transform::Day => {
+                if let Type::Primitive(p) = input_type {
+                    match p {
+                        PrimitiveType::Timestamp
+                        | PrimitiveType::Timestamptz
+                        | PrimitiveType::Date => 
Some(Type::Primitive(PrimitiveType::Int)),
+                        _ => None,
+                    }
+                } else {
+                    None
+                }
+            }
+            Transform::Hour => {
+                if let Type::Primitive(p) = input_type {
+                    match p {
+                        PrimitiveType::Timestamp | PrimitiveType::Timestamptz 
=> {
+                            Some(Type::Primitive(PrimitiveType::Int))
+                        }
+                        _ => None,
+                    }
+                } else {
+                    None
+                }
+            }
+        }

Review Comment:
   In Java and Python we also have the UnknownTransform. It doesn't happen 
much, but people could use their custom transform. Or a transform that is not 
yet supported by `iceberg-rust`, such as a [multi-arg bucket 
transform](https://github.com/apache/iceberg/pull/8259). In this case we would 
fall back to an unknown transform, and we cannot take any advantage of 
partitioning (we fall back to a full table scan 😭)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to