(spark) branch master updated: [SPARK-45988][SPARK-45989][PYTHON] Fix typehints to handle `list` GenericAlias in Python 3.11+

gurwls223 Sun, 19 Nov 2023 15:31:02 -0800

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 08f1d8b6ffe [SPARK-45988][SPARK-45989][PYTHON] Fix typehints to handle 
`list` GenericAlias in Python 3.11+
08f1d8b6ffe is described below

commit 08f1d8b6ffed2d9a4c0633bd65ac4cef13f5c745
Author: Dongjoon Hyun <[email protected]>
AuthorDate: Mon Nov 20 08:30:42 2023 +0900

    [SPARK-45988][SPARK-45989][PYTHON] Fix typehints to handle `list` 
GenericAlias in Python 3.11+
    
    ### What changes were proposed in this pull request?
    
    This PR aims to fix `type hints` to handle `list` GenericAlias in Python 
3.11+ for Apache Spark 4.0.0 and 3.5.1.
    - https://github.com/apache/spark/actions/workflows/build_python.yml
    
    ### Why are the changes needed?
    
    PEP 646 changes `GenericAlias` instances into `Iterable` ones at Python 
3.11.
    - https://peps.python.org/pep-0646/
    
    This behavior changes introduce the following failure on Python 3.11.
    
    - **Python 3.11.6**
    
    ```python
    Python 3.11.6 (main, Nov  1 2023, 07:46:30) [Clang 14.0.0 
(clang-1400.0.28.1)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
    23/11/18 16:34:09 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /__ / .__/\_,_/_/ /_/\_\   version 3.5.0
          /_/
    
    Using Python version 3.11.6 (main, Nov  1 2023 07:46:30)
    Spark context Web UI available at http://localhost:4040
    Spark context available as 'sc' (master = local[*], app id = 
local-1700354049391).
    SparkSession available as 'spark'.
    >>> from pyspark import pandas as ps
    >>> from typing import List
    >>> ps.DataFrame[float, [int, List[int]]]
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File 
"/Users/dongjoon/APACHE/spark-release/spark-3.5.0-bin-hadoop3/python/pyspark/pandas/frame.py",
 line 13647, in __class_getitem__
        return create_tuple_for_frame_type(params)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File 
"/Users/dongjoon/APACHE/spark-release/spark-3.5.0-bin-hadoop3/python/pyspark/pandas/typedef/typehints.py",
 line 717, in create_tuple_for_frame_type
        return Tuple[_to_type_holders(params)]
                     ^^^^^^^^^^^^^^^^^^^^^^^^
      File 
"/Users/dongjoon/APACHE/spark-release/spark-3.5.0-bin-hadoop3/python/pyspark/pandas/typedef/typehints.py",
 line 762, in _to_type_holders
        data_types = _new_type_holders(data_types, NameTypeHolder)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File 
"/Users/dongjoon/APACHE/spark-release/spark-3.5.0-bin-hadoop3/python/pyspark/pandas/typedef/typehints.py",
 line 828, in _new_type_holders
        raise TypeError(
    TypeError: Type hints should be specified as one of:
      - DataFrame[type, type, ...]
      - DataFrame[name: type, name: type, ...]
      - DataFrame[dtypes instance]
      - DataFrame[zip(names, types)]
      - DataFrame[index_type, [type, ...]]
      - DataFrame[(index_name, index_type), [(name, type), ...]]
      - DataFrame[dtype instance, dtypes instance]
      - DataFrame[(index_name, index_type), zip(names, types)]
      - DataFrame[[index_type, ...], [type, ...]]
      - DataFrame[[(index_name, index_type), ...], [(name, type), ...]]
      - DataFrame[dtypes instance, dtypes instance]
      - DataFrame[zip(index_names, index_types), zip(names, types)]
    However, got (<class 'int'>, typing.List[int]).
    ```
    
    - **Python 3.10.13**
    
    ```python
    Python 3.10.13 (main, Sep 29 2023, 16:03:45) [Clang 14.0.0 
(clang-1400.0.28.1)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
    23/11/18 16:33:21 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /__ / .__/\_,_/_/ /_/\_\   version 3.5.0
          /_/
    
    Using Python version 3.10.13 (main, Sep 29 2023 16:03:45)
    Spark context Web UI available at http://localhost:4040
    Spark context available as 'sc' (master = local[*], app id = 
local-1700354002048).
    SparkSession available as 'spark'.
    >>> from pyspark import pandas as ps
    >>> from typing import List
    >>> ps.DataFrame[float, [int, List[int]]]
    typing.Tuple[pyspark.pandas.typedef.typehints.IndexNameType, 
pyspark.pandas.typedef.typehints.NameType, 
pyspark.pandas.typedef.typehints.NameType]
    >>>
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the CIs. Manually test with Python 3.11.
    
    ```
    $ build/sbt -Phadoop-3 -Pkinesis-asl -Pyarn -Pkubernetes 
-Pdocker-integration-tests -Pconnect -Pspark-ganglia-lgpl -Pvolcano 
-Phadoop-cloud -Phive-thriftserver -Phive Test/package 
streaming-kinesis-asl-assembly/assembly connect/assembly
    $ python/run-tests --modules pyspark-pandas-slow --python-executables 
python3.11
    ```
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #43888 from dongjoon-hyun/SPARK-45988.
    
    Authored-by: Dongjoon Hyun <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
---
 python/pyspark/pandas/typedef/typehints.py | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/python/pyspark/pandas/typedef/typehints.py 
b/python/pyspark/pandas/typedef/typehints.py
index 57bfd7fcd83..bb0f70ee924 100644
--- a/python/pyspark/pandas/typedef/typehints.py
+++ b/python/pyspark/pandas/typedef/typehints.py
@@ -796,9 +796,21 @@ def _new_type_holders(
         isinstance(param, slice) and param.step is None and param.stop is not 
None
         for param in params
     )
-    is_unnamed_params = all(
-        not isinstance(param, slice) and not isinstance(param, Iterable) for 
param in params
-    )
+    if sys.version_info < (3, 11):
+        is_unnamed_params = all(
+            not isinstance(param, slice) and not isinstance(param, Iterable) 
for param in params
+        )
+    else:
+        # PEP 646 changes `GenericAlias` instances into iterable ones at 
Python 3.11
+        is_unnamed_params = all(
+            not isinstance(param, slice)
+            and (
+                not isinstance(param, Iterable)
+                or isinstance(param, typing.GenericAlias)
+                or isinstance(param, typing._GenericAlias)
+            )
+            for param in params
+        )
 
     if is_named_params:
         # DataFrame["id": int, "A": int]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-45988][SPARK-45989][PYTHON] Fix typehints to handle `list` GenericAlias in Python 3.11+

Reply via email to