This is an automated email from the ASF dual-hosted git repository.
ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 36da27fae56a [SPARK-46765][PYTHON][CONNECT] Make `shuffle` specify the
datatype of `seed`
36da27fae56a is described below
commit 36da27fae56aae4132ee1a2d646d0e4249e7d7e3
Author: Ruifeng Zheng <[email protected]>
AuthorDate: Fri Jan 19 14:11:44 2024 +0800
[SPARK-46765][PYTHON][CONNECT] Make `shuffle` specify the datatype of `seed`
### What changes were proposed in this pull request?
Make `shuffle` specify the datatype of `seed`
### Why are the changes needed?
`shuffle` function may fail with an extreme low possibility (~ 2e-10) :
`shuffle` requires a Long type `seed`, in an unregistered function, and
this Long value is extracted in Planner.
in Scala client the `SparkClassUtils.random.nextLong` make sure the type;
while in Python, `lit(random.randint(0, sys.maxsize))` may return a Literal
Integer instead of Literal Long.
```
In [26]: from pyspark.sql import functions as sf
In [27]: df = spark.createDataFrame([([1, 20, 3, 5],)], ['data'])
In [28]: df.select(sf.shuffle(df.data)).show()
+-------------+
|shuffle(data)|
+-------------+
|[1, 3, 5, 20]|
+-------------+
In [29]: df.select(sf.call_udf("shuffle", df.data,
sf.lit(123456789000000))).show()
+-------------+
|shuffle(data)|
+-------------+
|[20, 1, 5, 3]|
+-------------+
In [30]: df.select(sf.call_udf("shuffle", df.data, sf.lit(12345))).show()
...
SparkConnectGrpcException:
(org.apache.spark.sql.connect.common.InvalidPlanInput) seed should be a literal
long, but got 12345
```
Another case is `uuid`, but it is not supported in Python due to namespace
conflicts.
I don't find other similar cases.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
manually check
### Was this patch authored or co-authored using generative AI tooling?
no
Closes #44793 from zhengruifeng/py_shuffle_long.
Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
---
python/pyspark/sql/connect/functions/builtin.py | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/python/pyspark/sql/connect/functions/builtin.py
b/python/pyspark/sql/connect/functions/builtin.py
index 1e22a42c6241..7276cead88ef 100644
--- a/python/pyspark/sql/connect/functions/builtin.py
+++ b/python/pyspark/sql/connect/functions/builtin.py
@@ -59,7 +59,14 @@ from pyspark.sql.connect.udf import _create_py_udf
from pyspark.sql.connect.udtf import AnalyzeArgument, AnalyzeResult # noqa:
F401
from pyspark.sql.connect.udtf import _create_py_udtf
from pyspark.sql import functions as pysparkfuncs
-from pyspark.sql.types import _from_numpy_type, DataType, StructType,
ArrayType, StringType
+from pyspark.sql.types import (
+ _from_numpy_type,
+ DataType,
+ LongType,
+ StructType,
+ ArrayType,
+ StringType,
+)
# The implementation of pandas_udf is embedded in
pyspark.sql.function.pandas_udf
# for code reuse.
@@ -2116,7 +2123,11 @@ schema_of_xml.__doc__ =
pysparkfuncs.schema_of_xml.__doc__
def shuffle(col: "ColumnOrName") -> Column:
- return _invoke_function("shuffle", _to_col(col), lit(random.randint(0,
sys.maxsize)))
+ return _invoke_function(
+ "shuffle",
+ _to_col(col),
+ LiteralExpression(random.randint(0, sys.maxsize), LongType()),
+ )
shuffle.__doc__ = pysparkfuncs.shuffle.__doc__
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]