This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 08f1d8b6ffe [SPARK-45988][SPARK-45989][PYTHON] Fix typehints to handle
`list` GenericAlias in Python 3.11+
08f1d8b6ffe is described below
commit 08f1d8b6ffed2d9a4c0633bd65ac4cef13f5c745
Author: Dongjoon Hyun <[email protected]>
AuthorDate: Mon Nov 20 08:30:42 2023 +0900
[SPARK-45988][SPARK-45989][PYTHON] Fix typehints to handle `list`
GenericAlias in Python 3.11+
### What changes were proposed in this pull request?
This PR aims to fix `type hints` to handle `list` GenericAlias in Python
3.11+ for Apache Spark 4.0.0 and 3.5.1.
- https://github.com/apache/spark/actions/workflows/build_python.yml
### Why are the changes needed?
PEP 646 changes `GenericAlias` instances into `Iterable` ones at Python
3.11.
- https://peps.python.org/pep-0646/
This behavior changes introduce the following failure on Python 3.11.
- **Python 3.11.6**
```python
Python 3.11.6 (main, Nov 1 2023, 07:46:30) [Clang 14.0.0
(clang-1400.0.28.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
23/11/18 16:34:09 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.5.0
/_/
Using Python version 3.11.6 (main, Nov 1 2023 07:46:30)
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id =
local-1700354049391).
SparkSession available as 'spark'.
>>> from pyspark import pandas as ps
>>> from typing import List
>>> ps.DataFrame[float, [int, List[int]]]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File
"/Users/dongjoon/APACHE/spark-release/spark-3.5.0-bin-hadoop3/python/pyspark/pandas/frame.py",
line 13647, in __class_getitem__
return create_tuple_for_frame_type(params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/Users/dongjoon/APACHE/spark-release/spark-3.5.0-bin-hadoop3/python/pyspark/pandas/typedef/typehints.py",
line 717, in create_tuple_for_frame_type
return Tuple[_to_type_holders(params)]
^^^^^^^^^^^^^^^^^^^^^^^^
File
"/Users/dongjoon/APACHE/spark-release/spark-3.5.0-bin-hadoop3/python/pyspark/pandas/typedef/typehints.py",
line 762, in _to_type_holders
data_types = _new_type_holders(data_types, NameTypeHolder)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/Users/dongjoon/APACHE/spark-release/spark-3.5.0-bin-hadoop3/python/pyspark/pandas/typedef/typehints.py",
line 828, in _new_type_holders
raise TypeError(
TypeError: Type hints should be specified as one of:
- DataFrame[type, type, ...]
- DataFrame[name: type, name: type, ...]
- DataFrame[dtypes instance]
- DataFrame[zip(names, types)]
- DataFrame[index_type, [type, ...]]
- DataFrame[(index_name, index_type), [(name, type), ...]]
- DataFrame[dtype instance, dtypes instance]
- DataFrame[(index_name, index_type), zip(names, types)]
- DataFrame[[index_type, ...], [type, ...]]
- DataFrame[[(index_name, index_type), ...], [(name, type), ...]]
- DataFrame[dtypes instance, dtypes instance]
- DataFrame[zip(index_names, index_types), zip(names, types)]
However, got (<class 'int'>, typing.List[int]).
```
- **Python 3.10.13**
```python
Python 3.10.13 (main, Sep 29 2023, 16:03:45) [Clang 14.0.0
(clang-1400.0.28.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
23/11/18 16:33:21 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.5.0
/_/
Using Python version 3.10.13 (main, Sep 29 2023 16:03:45)
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id =
local-1700354002048).
SparkSession available as 'spark'.
>>> from pyspark import pandas as ps
>>> from typing import List
>>> ps.DataFrame[float, [int, List[int]]]
typing.Tuple[pyspark.pandas.typedef.typehints.IndexNameType,
pyspark.pandas.typedef.typehints.NameType,
pyspark.pandas.typedef.typehints.NameType]
>>>
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs. Manually test with Python 3.11.
```
$ build/sbt -Phadoop-3 -Pkinesis-asl -Pyarn -Pkubernetes
-Pdocker-integration-tests -Pconnect -Pspark-ganglia-lgpl -Pvolcano
-Phadoop-cloud -Phive-thriftserver -Phive Test/package
streaming-kinesis-asl-assembly/assembly connect/assembly
$ python/run-tests --modules pyspark-pandas-slow --python-executables
python3.11
```
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes #43888 from dongjoon-hyun/SPARK-45988.
Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
---
python/pyspark/pandas/typedef/typehints.py | 18 +++++++++++++++---
1 file changed, 15 insertions(+), 3 deletions(-)
diff --git a/python/pyspark/pandas/typedef/typehints.py
b/python/pyspark/pandas/typedef/typehints.py
index 57bfd7fcd83..bb0f70ee924 100644
--- a/python/pyspark/pandas/typedef/typehints.py
+++ b/python/pyspark/pandas/typedef/typehints.py
@@ -796,9 +796,21 @@ def _new_type_holders(
isinstance(param, slice) and param.step is None and param.stop is not
None
for param in params
)
- is_unnamed_params = all(
- not isinstance(param, slice) and not isinstance(param, Iterable) for
param in params
- )
+ if sys.version_info < (3, 11):
+ is_unnamed_params = all(
+ not isinstance(param, slice) and not isinstance(param, Iterable)
for param in params
+ )
+ else:
+ # PEP 646 changes `GenericAlias` instances into iterable ones at
Python 3.11
+ is_unnamed_params = all(
+ not isinstance(param, slice)
+ and (
+ not isinstance(param, Iterable)
+ or isinstance(param, typing.GenericAlias)
+ or isinstance(param, typing._GenericAlias)
+ )
+ for param in params
+ )
if is_named_params:
# DataFrame["id": int, "A": int]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]