This is an automated email from the ASF dual-hosted git repository.
dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 00928f458c09 [SPARK-53716][PYTHON][DOCS] Document vectorized UDF with
`@udf`
00928f458c09 is described below
commit 00928f458c09e3446e215128bd59977aca00b932
Author: Ruifeng Zheng <[email protected]>
AuthorDate: Thu Sep 25 16:34:25 2025 -0700
[SPARK-53716][PYTHON][DOCS] Document vectorized UDF with `@udf`
### What changes were proposed in this pull request?
Document vectorized UDF with `udf`
### Why are the changes needed?
to document this new feature
### Does this PR introduce _any_ user-facing change?
yes, doc-only change
### How was this patch tested?
ci, doctest
### Was this patch authored or co-authored using generative AI tooling?
no
Closes #52454 from zhengruifeng/doc_udf_vec.
Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
---
python/pyspark/sql/functions/builtin.py | 59 +++++++++++++++++++++++++++++++++
1 file changed, 59 insertions(+)
diff --git a/python/pyspark/sql/functions/builtin.py
b/python/pyspark/sql/functions/builtin.py
index 68753ba3351d..1df8ddd1f432 100644
--- a/python/pyspark/sql/functions/builtin.py
+++ b/python/pyspark/sql/functions/builtin.py
@@ -27515,6 +27515,10 @@ def udf(
----------
f : function, optional
python function if used as a standalone function
+
+ .. versionchanged:: 4.1.0
+ Supports vectorized function by specifiying the type hints.
+
returnType : :class:`pyspark.sql.types.DataType` or str, optional
the return type of the user-defined function. The value can be either a
:class:`pyspark.sql.types.DataType` object or a DDL-formatted type
string.
@@ -27559,6 +27563,61 @@ def udf(
| 101|
+-----------------------------+
+ Support vectorized function by specifiying the type hints.
+
+ To define a vectorized function, the function should meet following
requirements:
+
+ 1, have at least 1 argument. 0-arg is not supported;
+
+ 2, the type hints should match one of the patterns of pandas UDFs and
arrow UDFs;
+
+ 3, argument `useArrow` should not be explictly set;
+
+ If a function doesn't meet the requirements, the function should be
treated as a
+ vanilla python UDF or arrow-optimized python UDF (depending on argument
`useArrow`,
+ configuration `spark.sql.execution.pythonUDF.arrow.enabled`, and
dependency installations)
+
+ For example, define a 'Series to Series' type pandas UDF.
+
+ >>> from pyspark.sql.functions import udf, PandasUDFType
+ >>> import pandas as pd
+ >>> @udf(returnType=IntegerType())
+ ... def pd_calc(a: pd.Series, b: pd.Series) -> pd.Series:
+ ... return a + 10 * b
+ ...
+ >>> pd_calc.evalType == PandasUDFType.SCALAR
+ True
+ >>> spark.range(2).select(pd_calc(b=col("id") * 10, a="id")).show()
+ +--------------------------------+
+ |pd_calc(b => (id * 10), a => id)|
+ +--------------------------------+
+ | 0|
+ | 101|
+ +--------------------------------+
+
+ For another example, define a 'Array to Array' type arrow UDF.
+
+ >>> from pyspark.sql.functions import udf, ArrowUDFType
+ >>> import pyarrow as pa
+ >>> @udf(returnType=IntegerType())
+ ... def pa_calc(a: pa.Array, b: pa.Array) -> pa.Array:
+ ... return pa.compute.add(a, pa.compute.multiply(b, 10))
+ ...
+ >>> pa_calc.evalType == ArrowUDFType.SCALAR
+ True
+ >>> spark.range(2).select(pa_calc(b=col("id") * 10, a="id")).show()
+ +--------------------------------+
+ |pa_calc(b => (id * 10), a => id)|
+ +--------------------------------+
+ | 0|
+ | 101|
+ +--------------------------------+
+
+ See Also
+ --------
+ :meth:`pyspark.sql.functions.pandas_udf`
+ :meth:`pyspark.sql.functions.arrow_udf`
+
Notes
-----
The user-defined functions are considered deterministic by default. Due to
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]