(spark) branch master updated: [SPARK-53716][PYTHON][DOCS] Document vectorized UDF with `@udf`

dongjoon Thu, 25 Sep 2025 16:43:20 -0700

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 00928f458c09 [SPARK-53716][PYTHON][DOCS] Document vectorized UDF with 
`@udf`
00928f458c09 is described below

commit 00928f458c09e3446e215128bd59977aca00b932
Author: Ruifeng Zheng <[email protected]>
AuthorDate: Thu Sep 25 16:34:25 2025 -0700

    [SPARK-53716][PYTHON][DOCS] Document vectorized UDF with `@udf`
    
    ### What changes were proposed in this pull request?
    Document vectorized UDF with `udf`
    
    ### Why are the changes needed?
    to document this new feature
    
    ### Does this PR introduce _any_ user-facing change?
    yes, doc-only change
    
    ### How was this patch tested?
    ci, doctest
    
    ### Was this patch authored or co-authored using generative AI tooling?
    no
    
    Closes #52454 from zhengruifeng/doc_udf_vec.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
---
 python/pyspark/sql/functions/builtin.py | 59 +++++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/python/pyspark/sql/functions/builtin.py 
b/python/pyspark/sql/functions/builtin.py
index 68753ba3351d..1df8ddd1f432 100644
--- a/python/pyspark/sql/functions/builtin.py
+++ b/python/pyspark/sql/functions/builtin.py
@@ -27515,6 +27515,10 @@ def udf(
     ----------
     f : function, optional
         python function if used as a standalone function
+
+        .. versionchanged:: 4.1.0
+           Supports vectorized function by specifiying the type hints.
+
     returnType : :class:`pyspark.sql.types.DataType` or str, optional
         the return type of the user-defined function. The value can be either a
         :class:`pyspark.sql.types.DataType` object or a DDL-formatted type 
string.
@@ -27559,6 +27563,61 @@ def udf(
     |                          101|
     +-----------------------------+
 
+    Support vectorized function by specifiying the type hints.
+
+    To define a vectorized function, the function should meet following 
requirements:
+
+    1, have at least 1 argument. 0-arg is not supported;
+
+    2, the type hints should match one of the patterns of pandas UDFs and 
arrow UDFs;
+
+    3, argument `useArrow` should not be explictly set;
+
+    If a function doesn't meet the requirements, the function should be 
treated as a
+    vanilla python UDF or arrow-optimized python UDF (depending on argument 
`useArrow`,
+    configuration `spark.sql.execution.pythonUDF.arrow.enabled`, and 
dependency installations)
+
+    For example, define a 'Series to Series' type pandas UDF.
+
+    >>> from pyspark.sql.functions import udf, PandasUDFType
+    >>> import pandas as pd
+    >>> @udf(returnType=IntegerType())
+    ... def pd_calc(a: pd.Series, b: pd.Series) -> pd.Series:
+    ...     return a + 10 * b
+    ...
+    >>> pd_calc.evalType == PandasUDFType.SCALAR
+    True
+    >>> spark.range(2).select(pd_calc(b=col("id") * 10, a="id")).show()
+    +--------------------------------+
+    |pd_calc(b => (id * 10), a => id)|
+    +--------------------------------+
+    |                               0|
+    |                             101|
+    +--------------------------------+
+
+    For another example, define a 'Array to Array' type arrow UDF.
+
+    >>> from pyspark.sql.functions import udf, ArrowUDFType
+    >>> import pyarrow as pa
+    >>> @udf(returnType=IntegerType())
+    ... def pa_calc(a: pa.Array, b: pa.Array) -> pa.Array:
+    ...     return pa.compute.add(a, pa.compute.multiply(b, 10))
+    ...
+    >>> pa_calc.evalType == ArrowUDFType.SCALAR
+    True
+    >>> spark.range(2).select(pa_calc(b=col("id") * 10, a="id")).show()
+    +--------------------------------+
+    |pa_calc(b => (id * 10), a => id)|
+    +--------------------------------+
+    |                               0|
+    |                             101|
+    +--------------------------------+
+
+    See Also
+    --------
+    :meth:`pyspark.sql.functions.pandas_udf`
+    :meth:`pyspark.sql.functions.arrow_udf`
+
     Notes
     -----
     The user-defined functions are considered deterministic by default. Due to


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-53716][PYTHON][DOCS] Document vectorized UDF with `@udf`

Reply via email to