(spark) branch master updated: [SPARK-46765][PYTHON][CONNECT] Make `shuffle` specify the datatype of `seed`

ruifengz Thu, 18 Jan 2024 22:12:05 -0800

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 36da27fae56a [SPARK-46765][PYTHON][CONNECT] Make `shuffle` specify the 
datatype of `seed`
36da27fae56a is described below

commit 36da27fae56aae4132ee1a2d646d0e4249e7d7e3
Author: Ruifeng Zheng <[email protected]>
AuthorDate: Fri Jan 19 14:11:44 2024 +0800

    [SPARK-46765][PYTHON][CONNECT] Make `shuffle` specify the datatype of `seed`
    
    ### What changes were proposed in this pull request?
    Make `shuffle` specify the datatype of `seed`
    
    ### Why are the changes needed?
    `shuffle` function may fail with an extreme low possibility (~ 2e-10) :
    
    `shuffle` requires a Long type `seed`, in an unregistered function, and 
this Long value is extracted in Planner.
    
    in Scala client the `SparkClassUtils.random.nextLong` make sure the type;
    while in Python, `lit(random.randint(0, sys.maxsize))` may return a Literal 
Integer instead of Literal Long.
    
    ```
    In [26]: from pyspark.sql import functions as sf
    
    In [27]: df = spark.createDataFrame([([1, 20, 3, 5],)], ['data'])
    
    In [28]: df.select(sf.shuffle(df.data)).show()
    +-------------+
    |shuffle(data)|
    +-------------+
    |[1, 3, 5, 20]|
    +-------------+
    
    In [29]: df.select(sf.call_udf("shuffle", df.data, 
sf.lit(123456789000000))).show()
    +-------------+
    |shuffle(data)|
    +-------------+
    |[20, 1, 5, 3]|
    +-------------+
    
    In [30]: df.select(sf.call_udf("shuffle", df.data, sf.lit(12345))).show()
    ...
    SparkConnectGrpcException: 
(org.apache.spark.sql.connect.common.InvalidPlanInput) seed should be a literal 
long, but got 12345
    
    ```
    
    Another case is `uuid`, but it is not supported in Python due to namespace 
conflicts.
    I don't find other similar cases.
    
    ### Does this PR introduce _any_ user-facing change?
    no
    
    ### How was this patch tested?
    manually check
    
    ### Was this patch authored or co-authored using generative AI tooling?
    no
    
    Closes #44793 from zhengruifeng/py_shuffle_long.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Ruifeng Zheng <[email protected]>
---
 python/pyspark/sql/connect/functions/builtin.py | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/sql/connect/functions/builtin.py 
b/python/pyspark/sql/connect/functions/builtin.py
index 1e22a42c6241..7276cead88ef 100644
--- a/python/pyspark/sql/connect/functions/builtin.py
+++ b/python/pyspark/sql/connect/functions/builtin.py
@@ -59,7 +59,14 @@ from pyspark.sql.connect.udf import _create_py_udf
 from pyspark.sql.connect.udtf import AnalyzeArgument, AnalyzeResult  # noqa: 
F401
 from pyspark.sql.connect.udtf import _create_py_udtf
 from pyspark.sql import functions as pysparkfuncs
-from pyspark.sql.types import _from_numpy_type, DataType, StructType, 
ArrayType, StringType
+from pyspark.sql.types import (
+    _from_numpy_type,
+    DataType,
+    LongType,
+    StructType,
+    ArrayType,
+    StringType,
+)
 
 # The implementation of pandas_udf is embedded in 
pyspark.sql.function.pandas_udf
 # for code reuse.
@@ -2116,7 +2123,11 @@ schema_of_xml.__doc__ = 
pysparkfuncs.schema_of_xml.__doc__
 
 
 def shuffle(col: "ColumnOrName") -> Column:
-    return _invoke_function("shuffle", _to_col(col), lit(random.randint(0, 
sys.maxsize)))
+    return _invoke_function(
+        "shuffle",
+        _to_col(col),
+        LiteralExpression(random.randint(0, sys.maxsize), LongType()),
+    )
 
 
 shuffle.__doc__ = pysparkfuncs.shuffle.__doc__


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-46765][PYTHON][CONNECT] Make `shuffle` specify the datatype of `seed`

Reply via email to