Repository: spark
Updated Branches:
refs/heads/master a9350d709 -> 087fb3142
[SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_udf` with keyword args
## What changes were proposed in this pull request?
Add documentation about the limitations of `pandas_udf` with keyword arguments
and related concepts, like `functools.partial` fn objects.
NOTE: intermediate commits on this PR show some of the steps that can be taken
to fix some (but not all) of these pain points.
### Survey of problems we face today:
(Initialize) Note: python 3.6 and spark 2.4snapshot.
```
from pyspark.sql import SparkSession
import inspect, functools
from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit, udf
spark = SparkSession.builder.getOrCreate()
print(spark.version)
df = spark.range(1,6).withColumn('b', col('id') * 2)
def ok(a,b): return a+b
```
Using a keyword argument at the call site `b=...` (and yes, *full* stack trace
below, haha):
```
---> 14 df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id',
b='id')).show() # no kwargs
TypeError: wrapper() got an unexpected keyword argument 'b'
```
Using partial with a keyword argument where the kw-arg is the first argument of
the fn:
*(Aside: kind of interesting that lines 15,16 work great and then 17 explodes)*
```
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-9-e9f31b8799c1> in <module>()
15 df.withColumn('ok', pandas_udf(f=functools.partial(ok, 7),
returnType='bigint')('id')).show()
16 df.withColumn('ok', pandas_udf(f=functools.partial(ok, b=7),
returnType='bigint')('id')).show()
---> 17 df.withColumn('ok', pandas_udf(f=functools.partial(ok, a=7),
returnType='bigint')('id')).show()
/Users/stu/ZZ/spark/python/pyspark/sql/functions.py in pandas_udf(f,
returnType, functionType)
2378 return functools.partial(_create_udf, returnType=return_type,
evalType=eval_type)
2379 else:
-> 2380 return _create_udf(f=f, returnType=return_type,
evalType=eval_type)
2381
2382
/Users/stu/ZZ/spark/python/pyspark/sql/udf.py in _create_udf(f, returnType,
evalType)
54 argspec.varargs is None:
55 raise ValueError(
---> 56 "Invalid function: 0-arg pandas_udfs are not supported.
"
57 "Instead, create a 1-arg pandas_udf and ignore the arg
in your function."
58 )
ValueError: Invalid function: 0-arg pandas_udfs are not supported. Instead,
create a 1-arg pandas_udf and ignore the arg in your function.
```
Author: Michael (Stu) Stewart <[email protected]>
Closes #20900 from mstewart141/udfkw2.
Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/087fb314
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/087fb314
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/087fb314
Branch: refs/heads/master
Commit: 087fb3142028d679524e22596b0ad4f74ff47e8d
Parents: a9350d7
Author: Michael (Stu) Stewart <[email protected]>
Authored: Mon Mar 26 12:45:45 2018 +0900
Committer: hyukjinkwon <[email protected]>
Committed: Mon Mar 26 12:45:45 2018 +0900
----------------------------------------------------------------------
python/pyspark/sql/functions.py | 4 ++++
1 file changed, 4 insertions(+)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/spark/blob/087fb314/python/pyspark/sql/functions.py
----------------------------------------------------------------------
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index a4edb1e..ad3e37c 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -2154,6 +2154,8 @@ def udf(f=None, returnType=StringType()):
in boolean expressions and it ends up with being executed all
internally. If the functions
can fail on special rows, the workaround is to incorporate the
condition into the functions.
+ .. note:: The user-defined functions do not take keyword arguments on the
calling side.
+
:param f: python function if used as a standalone function
:param returnType: the return type of the user-defined function. The value
can be either a
:class:`pyspark.sql.types.DataType` object or a DDL-formatted type
string.
@@ -2337,6 +2339,8 @@ def pandas_udf(f=None, returnType=None,
functionType=None):
.. note:: The user-defined functions do not support conditional
expressions or short circuiting
in boolean expressions and it ends up with being executed all
internally. If the functions
can fail on special rows, the workaround is to incorporate the
condition into the functions.
+
+ .. note:: The user-defined functions do not take keyword arguments on the
calling side.
"""
# decorator @pandas_udf(returnType, functionType)
is_decorator = f is None or isinstance(f, (str, DataType))
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]