singhpk234 commented on code in PR #14117:
URL: https://github.com/apache/iceberg/pull/14117#discussion_r2484042930


##########
format/udf-spec.md:
##########
@@ -0,0 +1,299 @@
+---
+title: "SQL UDF Spec"
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Iceberg UDF Spec
+
+## Background and Motivation
+
+A SQL user-defined function (UDF or UDTF) is a callable routine that accepts 
input parameters and executes a function body.
+Depending on the function type, the result can be:
+
+- **Scalar function (UDF)** – returns a scalar value, which may be a primitive 
type (e.g., `int`, `string`) or a non-primitive type (e.g., `struct`, `list`).
+- **Table function (UDTF)** – returns a table with zero or more rows of 
columns with a uniform schema.
+
+Many compute engines (e.g., Spark, Trino) already support UDFs, but in 
different and incompatible ways. Without a common
+standard, UDFs cannot be reliably shared across engines or reused in 
multi-engine environments.
+
+This specification introduces a standardized metadata format for UDFs in 
Iceberg.
+
+## Goals
+
+* Define a portable metadata format for both scalar and table SQL UDFs. The 
metadata is self-contained and can be moved across catalogs.
+* Support function evolution through versioning and rollback.
+* Provide consistent semantics for representing UDFs across engines.
+
+## Overview
+
+UDF metadata follows the same design principles as Iceberg table and view 
metadata: each function is represented by a
+**self-contained metadata file**. Metadata captures definitions, parameters, 
return types, documentation, security,
+properties, and engine-specific representations.
+
+* Any modification (new definition, updated representation, changed 
properties, etc.) creates a new metadata file, and atomically swaps in the new 
file as the current metadata.
+* Each metadata file includes recent definition versions, enabling rollbacks 
without external state.
+
+## Specification
+
+### UDF Metadata
+The UDF metadata file has the following fields:
+
+| Requirement | Field name       | Type                   | Description        
                                              |
+|-------------|------------------|------------------------|------------------------------------------------------------------|
+| *required*  | `function-uuid`  | `string`               | A UUID that 
identifies the function, generated once at creation. |
+| *required*  | `format-version` | `int`                  | Metadata format 
version (must be `1`).                           |
+| *required*  | `definitions`    | `list<definition>`     | List of function 
[definition](#definition) entities.             |
+| *required*  | `definition-log` | `list<definition-log>` | History of 
[definitions snapshots](#definitions-log).            |
+| *optional*  | `location`       | `string`               | Storage location 
of metadata files.                              |
+| *optional*  | `properties`     | `map`                  | A string to string 
map of properties.                            |
+| *optional*  | `secure`         | `boolean`              | Whether it is a 
secure function. Default: `false`.               |
+| *optional*  | `doc`            | `string`               | Documentation 
string.                                            |
+
+Notes:
+1. When `secure` is `true`:
+   - Engines **SHOULD NOT** expose the function definition through inspection 
(e.g., `SHOW FUNCTIONS`).
+   - Engines **SHOULD** ensure that execution does not leak sensitive 
information through error messages, logs, or query plans.
+
+### Definition
+
+Each `definition` represents one function signature (e.g., `add_one(int)` vs 
`add_one(float)`).
+
+| Requirement | Field name           | Type                                    
                                                                                
                                       | Description                            
                                                                                
                                                                             |
+|-------------|----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| *required*  | `definition-id`      | `string`                                
                                                                                
                                       | An identifier derived from canonical 
parameter-type tuple (lowercase, no spaces; e.g., `"(int,int,string)"`). If 
longer than 128 chars, use hashed form 
`"sig1-<base32(SHA-256(signature))[:26]>"`. |
+| *required*  | `parameters`         | `list<parameter>`                       
                                                                                
                                       | Ordered list of [function 
parameters](#parameter). Invocation order **must** match this list.             
                                                                                
          |
+| *required*  | `return-type`        | [JSON 
representation](https://iceberg.apache.org/spec/#appendix-c-json-serialization) 
of an Iceberg type (`string` for primitives, `object` for complex types) | Type 
of value returned                                                               
                                                                                
                               |
+| *optional*  | `nullable-return`    | `boolean`                               
                                                                                
                                       | Whether the return value is nullable 
or not. Default: `true`.                                                        
                                                                               |
+| *required*  | `versions`           | `list<definition-version>`              
                                                                                
                                       | [Versioned 
implementations](#definition-version) of this definition.                       
                                                                                
                         |
+| *required*  | `current-version-id` | `int`                                   
                                                                                
                                       | Identifier of the current version for 
this definition.                                                                
                                                                              |
+| *optional*  | `function-type`      | `string` (`"udf"` or `"udtf"`, default 
`"udf"`)                                                                        
                                        | If `"udtf"`, `return-type` must be a 
`struct` describing the output schema.                                          
                                                                               |
+| *optional*  | `doc`                | `string`                                
                                                                                
                                       | Documentation string.                  
                                                                                
                                                                             |
+
+Notes:
+1. `sig1-<base32(SHA-256(signature))[:26]>` is a fixed-length hash used when 
the canonical signature is too long.
+It’s generated by taking the SHA-256 of the normalized signature, encoding it 
in Base32, keeping the first 26 characters,
+and prefixing with `sig1-`. This yields a 31-character deterministic ID, easy 
to verify and future-proof via the `sig1-`
+version prefix.
+
+### Parameter
+| Requirement | Field  | Type                                                  
                                                                                
                         | Description              |
+|-------------|--------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------|
+| *required*  | `name` | `string`                                              
                                                                                
                         | Parameter name.          |

Review Comment:
   Are the names need to be considered as case-sensitive or in sensitive always 
? we might need this info when binding to get field-ID
   
   a quick search shows PG supports case sensitive params
   
   ```
   CREATE FUNCTION get_double(input_value int, "MyArg" int)
   RETURNS int AS $$
   BEGIN
       -- You MUST use quotes to access it
       RETURN input_value + "MyArg";
   END;
   $$ LANGUAGE plpgsql;
   ```



##########
format/udf-spec.md:
##########
@@ -0,0 +1,299 @@
+---
+title: "SQL UDF Spec"
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Iceberg UDF Spec
+
+## Background and Motivation
+
+A SQL user-defined function (UDF or UDTF) is a callable routine that accepts 
input parameters and executes a function body.
+Depending on the function type, the result can be:
+
+- **Scalar function (UDF)** – returns a scalar value, which may be a primitive 
type (e.g., `int`, `string`) or a non-primitive type (e.g., `struct`, `list`).
+- **Table function (UDTF)** – returns a table with zero or more rows of 
columns with a uniform schema.
+
+Many compute engines (e.g., Spark, Trino) already support UDFs, but in 
different and incompatible ways. Without a common
+standard, UDFs cannot be reliably shared across engines or reused in 
multi-engine environments.
+
+This specification introduces a standardized metadata format for UDFs in 
Iceberg.
+
+## Goals
+
+* Define a portable metadata format for both scalar and table SQL UDFs. The 
metadata is self-contained and can be moved across catalogs.
+* Support function evolution through versioning and rollback.
+* Provide consistent semantics for representing UDFs across engines.
+
+## Overview
+
+UDF metadata follows the same design principles as Iceberg table and view 
metadata: each function is represented by a
+**self-contained metadata file**. Metadata captures definitions, parameters, 
return types, documentation, security,
+properties, and engine-specific representations.
+
+* Any modification (new definition, updated representation, changed 
properties, etc.) creates a new metadata file, and atomically swaps in the new 
file as the current metadata.
+* Each metadata file includes recent definition versions, enabling rollbacks 
without external state.
+
+## Specification
+
+### UDF Metadata
+The UDF metadata file has the following fields:
+
+| Requirement | Field name       | Type                   | Description        
                                              |
+|-------------|------------------|------------------------|------------------------------------------------------------------|
+| *required*  | `function-uuid`  | `string`               | A UUID that 
identifies the function, generated once at creation. |
+| *required*  | `format-version` | `int`                  | Metadata format 
version (must be `1`).                           |
+| *required*  | `definitions`    | `list<definition>`     | List of function 
[definition](#definition) entities.             |
+| *required*  | `definition-log` | `list<definition-log>` | History of 
[definitions snapshots](#definitions-log).            |
+| *optional*  | `location`       | `string`               | Storage location 
of metadata files.                              |
+| *optional*  | `properties`     | `map`                  | A string to string 
map of properties.                            |
+| *optional*  | `secure`         | `boolean`              | Whether it is a 
secure function. Default: `false`.               |
+| *optional*  | `doc`            | `string`               | Documentation 
string.                                            |
+
+Notes:
+1. When `secure` is `true`:
+   - Engines **SHOULD NOT** expose the function definition through inspection 
(e.g., `SHOW FUNCTIONS`).
+   - Engines **SHOULD** ensure that execution does not leak sensitive 
information through error messages, logs, or query plans.
+
+### Definition
+
+Each `definition` represents one function signature (e.g., `add_one(int)` vs 
`add_one(float)`).
+
+| Requirement | Field name           | Type                                    
                                                                                
                                       | Description                            
                                                                                
                                                                             |
+|-------------|----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| *required*  | `definition-id`      | `string`                                
                                                                                
                                       | An identifier derived from canonical 
parameter-type tuple (lowercase, no spaces; e.g., `"(int,int,string)"`). If 
longer than 128 chars, use hashed form 
`"sig1-<base32(SHA-256(signature))[:26]>"`. |
+| *required*  | `parameters`         | `list<parameter>`                       
                                                                                
                                       | Ordered list of [function 
parameters](#parameter). Invocation order **must** match this list.             
                                                                                
          |
+| *required*  | `return-type`        | [JSON 
representation](https://iceberg.apache.org/spec/#appendix-c-json-serialization) 
of an Iceberg type (`string` for primitives, `object` for complex types) | Type 
of value returned                                                               
                                                                                
                               |
+| *optional*  | `nullable-return`    | `boolean`                               
                                                                                
                                       | Whether the return value is nullable 
or not. Default: `true`.                                                        
                                                                               |
+| *required*  | `versions`           | `list<definition-version>`              
                                                                                
                                       | [Versioned 
implementations](#definition-version) of this definition.                       
                                                                                
                         |
+| *required*  | `current-version-id` | `int`                                   
                                                                                
                                       | Identifier of the current version for 
this definition.                                                                
                                                                              |
+| *optional*  | `function-type`      | `string` (`"udf"` or `"udtf"`, default 
`"udf"`)                                                                        
                                        | If `"udtf"`, `return-type` must be a 
`struct` describing the output schema.                                          
                                                                               |
+| *optional*  | `doc`                | `string`                                
                                                                                
                                       | Documentation string.                  
                                                                                
                                                                             |
+
+Notes:
+1. `sig1-<base32(SHA-256(signature))[:26]>` is a fixed-length hash used when 
the canonical signature is too long.
+It’s generated by taking the SHA-256 of the normalized signature, encoding it 
in Base32, keeping the first 26 characters,
+and prefixing with `sig1-`. This yields a 31-character deterministic ID, easy 
to verify and future-proof via the `sig1-`
+version prefix.
+
+### Parameter
+| Requirement | Field  | Type                                                  
                                                                                
                         | Description              |
+|-------------|--------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------|
+| *required*  | `name` | `string`                                              
                                                                                
                         | Parameter name.          |
+| *required*  | `type` | [JSON 
representation](https://iceberg.apache.org/spec/#appendix-c-json-serialization) 
of an Iceberg type (`string` for primitives, `object` for complex types) | 
Parameter data type.     |
+| *optional*  | `doc`  | `string`                                              
                                                                                
                         | Parameter documentation. |
+
+Notes:
+1. Function definitions are identified by the tuple of `type`s and there can 
be only one definition for a given tuple.
+2. Parameter `name`s are immutable since named argument invocation is 
supported (e.g., `foo(a => 1, b => 2)`). Only `doc` can be updated in place. 
+3. Variadic (vararg) parameters are not supported. Each definition must 
declare a fixed number of parameters.
+4. Each parameter input MUST be assignable to its declared Iceberg type. For 
complex types, the value’s
+   structure must match (correct field names, element/key/value types, and 
nesting). If a parameter—or any nested
+   field/element—is marked required, engines MUST reject null at that position 
(including inside structs, lists, and maps).
+5. The `return-type` is immutable. To change it, users must create a new 
definition and remove the old one.
+6. The function MUST return a value assignable to the declared `return-type`, 
meaning the returned value’s type and
+   structure must match the declared Iceberg type (including field names, 
element types, and nesting for complex types),
+   and any field or element marked as required MUST NOT be null. Engines MUST 
reject results that violate these rules.
+
+### Definition-Version
+
+Each definition can evolve over time by introducing new versions.  
+A `definition version` represents a specific implementation of that definition 
at a given point in time.
+
+| Requirement | Field name        | Type                                       
                                                             | Description      
                                              |
+|-------------|-------------------|---------------------------------------------------------------------------------------------------------|----------------------------------------------------------------|
+| *required*  | `version-id`      | `int`                                      
                                                             | Monotonically 
increasing identifier of the definition version. |
+| *required*  | `representations` | `list<representation>`                     
                                                             | 
[Dialect-specific implementations](#representation).           |
+| *optional*  | `deterministic`   | `boolean` (default `false`)                
                                                             | Whether the 
function is deterministic.                         |
+| *optional*  | `on-null-input`   | `string` (`"returns_null_on_null_input"` 
or `"called_on_null_input"`, default `"called_on_null_input"`) | Defines how 
the UDF behaves when any input parameter is NULL.  |
+| *required*  | `timestamp-ms`    | `long` (epoch millis)                      
                                                             | Creation 
timestamp of this version.                            |
+
+Note:
+
+`on-null-input` provides an optimization hint for query engines, its value can 
be either `"returns_null_on_null_input"` or `"called_on_null_input"`:
+1. If set to `returns_null_on_null_input`, the function always returns `NULL` 
if any input argument is `NULL`. This allows engines to apply predicate 
pushdown or skip function evaluation for rows with `NULL` inputs. For a 
function `f(x, y) = x + y`,
+the engine can safely rewrite `WHERE f(a,b) > 0` as `WHERE a IS NOT NULL AND b 
IS NOT NULL AND f(a,b) > 0`.
+2. If set to `called_on_null_input`, the function may handle `NULL`s 
internally (e.g., `COALESCE`, `NVL`, `IFNULL`), so the engine must execute the 
function even if some inputs are `NULL`.
+
+### Representation
+A representation encodes how the definition version is expressed in a specific 
SQL dialect.
+
+| Requirement | Field name   | Type              | Description                 
                                                                |
+|-------------|--------------|-------------------|---------------------------------------------------------------------------------------------|
+| *required*  | `type`       | `string`          | Must be `"sql"`             
                                                                |
+| *required*  | `dialect`    | `string`          | SQL dialect identifier 
(e.g., `"spark"`, `"trino"`).                                        |
+| *optional*  | `parameters` | `list<parameter>` | Ordered list of [function 
parameters](#parameter). Overrides canonical names in definition. |
+| *required*  | `body`       | `string`          | SQL expression text.        
                                                                |
+
+Note: The `body` must be valid SQL in the specified dialect; validation is the 
responsibility of the consuming engine.
+
+### Definition log
+| Requirement | Field name            | Type                                   
            | Description                                                      |
+|-------------|-----------------------|----------------------------------------------------|------------------------------------------------------------------|
+| *required*  | `timestamp-ms`        | `long` (epoch millis)                  
            | When the definition snapshot was created or updated.             |
+| *required*  | `definition-versions` | `list<{ definition-id: string, 
version-id: int }>` | Mapping of each definition to its selected version at 
this time. |
+
+## Function Resolution in Engines
+Resolution rule is decided by engines, but engines SHOULD:

Review Comment:
   [doubt] if engines are deciding the resolution rule, then wouldn't the out 
of udf becomes engine dependent for example if some one prefers double over 
float and both over-loads exists this would make them produce different output 
? 



##########
format/udf-spec.md:
##########
@@ -0,0 +1,299 @@
+---
+title: "SQL UDF Spec"
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Iceberg UDF Spec
+
+## Background and Motivation
+
+A SQL user-defined function (UDF or UDTF) is a callable routine that accepts 
input parameters and executes a function body.
+Depending on the function type, the result can be:
+
+- **Scalar function (UDF)** – returns a scalar value, which may be a primitive 
type (e.g., `int`, `string`) or a non-primitive type (e.g., `struct`, `list`).
+- **Table function (UDTF)** – returns a table with zero or more rows of 
columns with a uniform schema.
+
+Many compute engines (e.g., Spark, Trino) already support UDFs, but in 
different and incompatible ways. Without a common
+standard, UDFs cannot be reliably shared across engines or reused in 
multi-engine environments.
+
+This specification introduces a standardized metadata format for UDFs in 
Iceberg.
+
+## Goals
+
+* Define a portable metadata format for both scalar and table SQL UDFs. The 
metadata is self-contained and can be moved across catalogs.
+* Support function evolution through versioning and rollback.
+* Provide consistent semantics for representing UDFs across engines.
+
+## Overview
+
+UDF metadata follows the same design principles as Iceberg table and view 
metadata: each function is represented by a
+**self-contained metadata file**. Metadata captures definitions, parameters, 
return types, documentation, security,
+properties, and engine-specific representations.
+
+* Any modification (new definition, updated representation, changed 
properties, etc.) creates a new metadata file, and atomically swaps in the new 
file as the current metadata.
+* Each metadata file includes recent definition versions, enabling rollbacks 
without external state.
+
+## Specification
+
+### UDF Metadata
+The UDF metadata file has the following fields:
+
+| Requirement | Field name       | Type                   | Description        
                                              |
+|-------------|------------------|------------------------|------------------------------------------------------------------|
+| *required*  | `function-uuid`  | `string`               | A UUID that 
identifies the function, generated once at creation. |
+| *required*  | `format-version` | `int`                  | Metadata format 
version (must be `1`).                           |
+| *required*  | `definitions`    | `list<definition>`     | List of function 
[definition](#definition) entities.             |
+| *required*  | `definition-log` | `list<definition-log>` | History of 
[definitions snapshots](#definitions-log).            |
+| *optional*  | `location`       | `string`               | Storage location 
of metadata files.                              |
+| *optional*  | `properties`     | `map`                  | A string to string 
map of properties.                            |
+| *optional*  | `secure`         | `boolean`              | Whether it is a 
secure function. Default: `false`.               |
+| *optional*  | `doc`            | `string`               | Documentation 
string.                                            |
+
+Notes:
+1. When `secure` is `true`:
+   - Engines **SHOULD NOT** expose the function definition through inspection 
(e.g., `SHOW FUNCTIONS`).
+   - Engines **SHOULD** ensure that execution does not leak sensitive 
information through error messages, logs, or query plans.

Review Comment:
   [doubt]
   1/ what would engines do if they can enforce these requirements ? FAIL, 
otherwise one can just use an engine that doesn't enforce this and get udf 
definition 
   2/ Do we want to say something about predicate re-order attack, implying the 
udf should be executed completely (not participating in optimizer's predicate 
reorder) and in an isolation ? example attack : 
https://docs.snowflake.com/en/user-guide/views-secure#how-might-data-be-exposed-by-a-non-secure-view
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to