(spark) branch master updated: [SPARK-44733][PYTHON][DOCS] Add Python to Spark type conversion page to PySpark docs.

gurwls223 Mon, 13 Nov 2023 18:26:19 -0800

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new cd19d6c299a7 [SPARK-44733][PYTHON][DOCS] Add Python to Spark type 
conversion page to PySpark docs.
cd19d6c299a7 is described below

commit cd19d6c299a7bc6d8a785208654dff132ca5fe1b
Author: Phil Dakin <[email protected]>
AuthorDate: Tue Nov 14 11:25:57 2023 +0900

    [SPARK-44733][PYTHON][DOCS] Add Python to Spark type conversion page to 
PySpark docs.
    
    allisonwang-db
    
    ### What changes were proposed in this pull request?
    Add documentation page showing Python to Spark type mappings for PySpark.
    
    ### Why are the changes needed?
    Surface this information to users navigating the PySpark docs per 
https://issues.apache.org/jira/browse/SPARK-44733.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, adds new page to PySpark docs.
    
    ### How was this patch tested?
    Build HTML docs file using Sphinx, inspect visually.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    
![full](https://github.com/apache/spark/assets/15946757/fde09420-5dc1-461c-9dc8-5e3c830740bd)
    
    Closes #43369 from PhilDakin/20231013.SPARK-44733.
    
    Authored-by: Phil Dakin <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
---
 docs/sql-ref-datatypes.md                          |   8 +-
 python/docs/source/user_guide/sql/index.rst        |   1 +
 .../source/user_guide/sql/type_conversions.rst     | 248 +++++++++++++++++++++
 3 files changed, 253 insertions(+), 4 deletions(-)

diff --git a/docs/sql-ref-datatypes.md b/docs/sql-ref-datatypes.md
index 25dc00f18a4e..041d22baf659 100644
--- a/docs/sql-ref-datatypes.md
+++ b/docs/sql-ref-datatypes.md
@@ -119,10 +119,10 @@ from pyspark.sql.types import *
 
 |Data type|Value type in Python|API to access or create a data type|
 |---------|--------------------|-----------------------------------|
-|**ByteType**|int or long<br/>**Note:** Numbers will be converted to 1-byte 
signed integer numbers at runtime. Please make sure that numbers are within the 
range of -128 to 127.|ByteType()|
-|**ShortType**|int or long<br/>**Note:** Numbers will be converted to 2-byte 
signed integer numbers at runtime. Please make sure that numbers are within the 
range of -32768 to 32767.|ShortType()|
-|**IntegerType**|int or long|IntegerType()|
-|**LongType**|long<br/>**Note:** Numbers will be converted to 8-byte signed 
integer numbers at runtime. Please make sure that numbers are within the range 
of -9223372036854775808 to 9223372036854775807. Otherwise, please convert data 
to decimal.Decimal and use DecimalType.|LongType()|
+|**ByteType**|int<br/>**Note:** Numbers will be converted to 1-byte signed 
integer numbers at runtime. Please make sure that numbers are within the range 
of -128 to 127.|ByteType()|
+|**ShortType**|int<br/>**Note:** Numbers will be converted to 2-byte signed 
integer numbers at runtime. Please make sure that numbers are within the range 
of -32768 to 32767.|ShortType()|
+|**IntegerType**|int|IntegerType()|
+|**LongType**|int<br/>**Note:** Numbers will be converted to 8-byte signed 
integer numbers at runtime. Please make sure that numbers are within the range 
of -9223372036854775808 to 9223372036854775807. Otherwise, please convert data 
to decimal.Decimal and use DecimalType.|LongType()|
 |**FloatType**|float<br/>**Note:** Numbers will be converted to 4-byte 
single-precision floating point numbers at runtime.|FloatType()|
 |**DoubleType**|float|DoubleType()|
 |**DecimalType**|decimal.Decimal|DecimalType()|
diff --git a/python/docs/source/user_guide/sql/index.rst 
b/python/docs/source/user_guide/sql/index.rst
index c0369de67865..118cf139d9b3 100644
--- a/python/docs/source/user_guide/sql/index.rst
+++ b/python/docs/source/user_guide/sql/index.rst
@@ -25,4 +25,5 @@ Spark SQL
 
    arrow_pandas
    python_udtf
+   type_conversions
 
diff --git a/python/docs/source/user_guide/sql/type_conversions.rst 
b/python/docs/source/user_guide/sql/type_conversions.rst
new file mode 100644
index 000000000000..b63e7dfa8851
--- /dev/null
+++ b/python/docs/source/user_guide/sql/type_conversions.rst
@@ -0,0 +1,248 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+..    http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+================================
+Python to Spark Type Conversions
+================================
+
+.. TODO: Add additional information on conversions when Arrow is enabled.
+.. TODO: Add in-depth explanation and table for type conversions (SPARK-44734).
+
+.. currentmodule:: pyspark.sql.types
+
+When working with PySpark, you will often need to consider the conversions 
between Python-native
+objects to their Spark equivalents. For instance, when working with 
user-defined functions, the
+function return type will be cast by Spark to an appropriate Spark SQL type. 
Or, when creating a
+``DataFrame``, you may supply ``numpy`` or ``pandas`` objects as the inputted 
data. This guide will cover
+the various conversions between Python and Spark SQL types.
+
+Browsing Type Conversions
+-------------------------
+
+Though this document provides a comprehensive list of type conversions, you 
may find it easier to
+interactively check the conversion behavior of Spark. To do so, you can test 
small examples of
+user-defined functions, and use the ``spark.createDataFrame`` interface.
+
+All data types of Spark SQL are located in the package of 
``pyspark.sql.types``.
+You can access them by doing:
+
+.. code-block:: python
+
+    from pyspark.sql.types import *
+
+Configuration
+-------------
+There are several configurations that affect the behavior of type conversions. 
These configurations
+are listed below:
+
+.. list-table::
+    :header-rows: 1
+
+    * - Configuration
+      - Description
+      - Default
+    * - spark.sql.execution.pythonUDF.arrow.enabled
+      - Enable PyArrow in PySpark. See more `here <arrow_pandas.rst>`_.
+      - False
+    * - spark.sql.pyspark.inferNestedDictAsStruct.enabled
+      - When enabled, nested dictionaries are inferred as StructType. 
Otherwise, they are inferred as MapType.
+      - False
+    * - spark.sql.timestampType
+      - If set to `TIMESTAMP_NTZ`, the default timestamp type is 
``TimestampNTZType``. Otherwise, the default timestamp type is TimestampType.
+      - ""
+
+All Conversions
+---------------
+.. list-table::
+    :header-rows: 1
+
+    * - Data type
+      - Value type in Python
+      - API to access or create a data type
+    * - **ByteType**
+      - int
+          .. note:: Numbers will be converted to 1-byte signed integer numbers 
at runtime. Please make sure that numbers are within the range of -128 to 127.
+      - ByteType()
+    * - **ShortType**
+      - int
+          .. note:: Numbers will be converted to 2-byte signed integer numbers 
at runtime. Please make sure that numbers are within the range of -32768 to 
32767.
+      - ShortType()
+    * - **IntegerType**
+      - int
+      - IntegerType()
+    * - **LongType**
+      - int
+          .. note:: Numbers will be converted to 8-byte signed integer numbers 
at runtime. Please make sure that numbers are within the range of 
-9223372036854775808 to 9223372036854775807. Otherwise, please convert data to 
decimal.Decimal and use DecimalType.
+      - LongType()
+    * - **FloatType**
+      - float
+          .. note:: Numbers will be converted to 4-byte single-precision 
floating point numbers at runtime.
+      - FloatType()
+    * - **DoubleType**
+      - float
+      - DoubleType()
+    * - **DecimalType**
+      - decimal.Decimal
+      - DecimalType()|
+    * - **StringType**
+      - string
+      - StringType()
+    * - **BinaryType**
+      - bytearray
+      - BinaryType()
+    * - **BooleanType**
+      - bool
+      - BooleanType()
+    * - **TimestampType**
+      - datetime.datetime
+      - TimestampType()
+    * - **TimestampNTZType**
+      - datetime.datetime
+      - TimestampNTZType()
+    * - **DateType**
+      - datetime.date
+      - DateType()
+    * - **DayTimeIntervalType**
+      - datetime.timedelta
+      - DayTimeIntervalType()
+    * - **ArrayType**
+      - list, tuple, or array
+      - ArrayType(*elementType*, [*containsNull*])
+          .. note:: The default value of *containsNull* is True.
+    * - **MapType**
+      - dict
+      - MapType(*keyType*, *valueType*, [*valueContainsNull]*)
+          .. note:: The default value of *valueContainsNull* is True.
+    * - **StructType**
+      - list or tuple
+      - StructType(*fields*)
+          .. note:: *fields* is a Seq of StructFields. Also, two fields with 
the same name are not allowed.
+    * - **StructField**
+      - The value type in Python of the data type of this field. For example, 
Int for a StructField with the data type IntegerType.
+      - StructField(*name*, *dataType*, [*nullable*])
+          .. note:: The default value of *nullable* is True.
+
+Conversions in Practice - UDFs
+------------------------------
+A common conversion case is returning a Python value from a UDF. In this case, 
the return type of
+the UDF must match the provided return type.
+
+.. note:: If the actual return type of your function does not match the 
provided return type, Spark will implicitly cast the value to null.
+
+.. code-block:: python
+
+  from pyspark.sql.types import (
+      StructType,
+      StructField,
+      IntegerType,
+      StringType,
+      FloatType,
+  )
+  from pyspark.sql.functions import udf, col
+
+  df = spark.createDataFrame(
+      [[1]], schema=StructType([StructField("int", IntegerType())])
+  )
+
+  @udf(returnType=StringType())
+  def to_string(value):
+      return str(value)
+
+  @udf(returnType=FloatType())
+  def to_float(value):
+      return float(value)
+
+  df.withColumn("cast_int", to_float(col("int"))).withColumn(
+      "cast_str", to_string(col("int"))
+  ).printSchema()
+  # root
+  # |-- int: integer (nullable = true)
+  # |-- cast_int: float (nullable = true)
+  # |-- cast_str: string (nullable = true)
+
+Conversions in Practice - Creating DataFrames
+---------------------------------------------
+Another common conversion case is when creating a DataFrame from values in 
Python. In this case,
+you can supply a schema, or allow Spark to infer the schema from the provided 
data.
+
+.. code-block:: python
+
+  data = [
+      ["Wei", "Math", 93.0, 1],
+      ["Jerry", "Physics", 85.0, 4],
+      ["Katrina", "Geology", 90.0, 2],
+  ]
+  cols = ["Name", "Subject", "Score", "Period"]
+
+  spark.createDataFrame(data, cols).printSchema()
+  # root
+  # |-- Name: string (nullable = true)
+  # |-- Subject: string (nullable = true)
+  # |-- Score: double (nullable = true)
+  # |-- Period: long (nullable = true)
+
+  import pandas as pd
+
+  df = pd.DataFrame(data, columns=cols)
+  spark.createDataFrame(df).printSchema()
+  # root
+  # |-- Name: string (nullable = true)
+  # |-- Subject: string (nullable = true)
+  # |-- Score: double (nullable = true)
+  # |-- Period: long (nullable = true)
+
+  import numpy as np
+
+  spark.createDataFrame(np.zeros([3, 2], "int8")).printSchema()
+  # root
+  # |-- _1: byte (nullable = true)
+  # |-- _2: byte (nullable = true)
+
+Conversions in Practice - Nested Data Types
+-------------------------------------------
+Nested data types will convert to ``StructType``, ``MapType``, and 
``ArrayType``, depending on the passed data.
+
+.. code-block:: python
+
+  data = [
+      ["Wei", [[1, 2]], {"RecordType": "Scores", "Math": { "H1": 93.0, "H2": 
85.0}}],
+  ]
+  cols = ["Name", "ActiveHalfs", "Record"]
+
+  spark.createDataFrame(data, cols).printSchema()
+  # root
+  #  |-- Name: string (nullable = true)
+  #  |-- ActiveHalfs: array (nullable = true)
+  #  |    |-- element: array (containsNull = true)
+  #  |    |    |-- element: long (containsNull = true)
+  #  |-- Record: map (nullable = true)
+  #  |    |-- key: string
+  #  |    |-- value: string (valueContainsNull = true)
+
+  spark.conf.set('spark.sql.pyspark.inferNestedDictAsStruct.enabled', True)
+
+  spark.createDataFrame(data, cols).printSchema()
+  # root
+  #  |-- Name: string (nullable = true)
+  #  |-- ActiveHalfs: array (nullable = true)
+  #  |    |-- element: array (containsNull = true)
+  #  |    |    |-- element: long (containsNull = true)
+  #  |-- Record: struct (nullable = true)
+  #  |    |-- RecordType: string (nullable = true)
+  #  |    |-- Math: struct (nullable = true)
+  #  |    |    |-- H1: double (nullable = true)
+  #  |    |    |-- H2: double (nullable = true)


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-44733][PYTHON][DOCS] Add Python to Spark type conversion page to PySpark docs.

Reply via email to