(spark) branch master updated: [SPARK-53959][CONNECT] Throw a client-side error when creating a dataframe from a pandas dataframe with an index but no data

gurwls223 Thu, 30 Oct 2025 21:19:18 -0700

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 9fae65bb220e [SPARK-53959][CONNECT] Throw a client-side error when 
creating a dataframe from a pandas dataframe with an index but no data
9fae65bb220e is described below

commit 9fae65bb220e77f5d3de633a84ae2ed6af270f1d
Author: Alex Khakhlyuk <[email protected]>
AuthorDate: Fri Oct 31 13:18:57 2025 +0900

    [SPARK-53959][CONNECT] Throw a client-side error when creating a dataframe 
from a pandas dataframe with an index but no data
    
    ### What changes were proposed in this pull request?
    
    Spark Connect Python client does not throw a proper error when creating a 
dataframe from a pandas dataframe with a index and empty data.
    Generally, spark connect client throws a client-side error 
`[CANNOT_INFER_EMPTY_SCHEMA] Can not infer schema from an empty dataset`. when 
creating a dataframe without data, for example via
    ```
    spark.createDataFrame([]).show()
    ```
    or
    ```
    df = pd.DataFrame()
    spark.createDataFrame(df).show()
    ```
    or
    ```
    df = pd.DataFrame({"a": []})
    spark.createDataFrame(df).show()
    ```
    
    This does not happen when pandas dataframe has an index but no data, e.g.
    ```
    df = pd.DataFrame(index=range(5))
    spark.createDataFrame(df).show()
    ```
    What happens instead is that the dataframe is successfully converted to a 
LocalRelation on the client, is sent to the server, but the server then throws 
the following exception: `INTERNAL_ERROR: Input data for LocalRelation does not 
produce a schema. SQLSTATE: XX000`. XX000 is an internal error sql state and 
the error is not actionable enough for the user.
    
    This PR fixes this problem by throwing `CANNOT_INFER_EMPTY_SCHEMA` in case 
the dataframe has rows (because of the index), but does not have columns.
    
    ### Why are the changes needed?
    
    Currently the error is thrown as a server-side internal error and the error 
message is not actionable enough for the user.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Creating a spark connect dataframe from an pandas dataframe with an index 
but no data will now throw a client-side error `[CANNOT_INFER_EMPTY_SCHEMA] Can 
not infer schema from an empty dataset` instead of the server-side 
`INTERNAL_ERROR: Input data for LocalRelation does not produce a schema. 
SQLSTATE: XX000`.
    
    ### How was this patch tested?
    
    New unit test.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes #52670 from khakhlyuk/pd-empty-data.
    
    Authored-by: Alex Khakhlyuk <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
---
 python/pyspark/sql/connect/session.py                 |  5 +++++
 .../sql/tests/connect/test_connect_creation.py        | 19 +++++++++++++++----
 2 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/python/pyspark/sql/connect/session.py 
b/python/pyspark/sql/connect/session.py
index 2a678c95c925..21a7c8329a35 100644
--- a/python/pyspark/sql/connect/session.py
+++ b/python/pyspark/sql/connect/session.py
@@ -559,6 +559,11 @@ class SparkSession:
             # If no schema supplied by user then get the names of columns only
             if schema is None:
                 _cols = [str(x) if not isinstance(x, str) else x for x in 
data.columns]
+                if len(_cols) == 0:
+                    raise PySparkValueError(
+                        errorClass="CANNOT_INFER_EMPTY_SCHEMA",
+                        messageParameters={},
+                    )
                 infer_pandas_dict_as_map = (
                     configs["spark.sql.execution.pandas.inferPandasDictAsMap"] 
== "true"
                 )
diff --git a/python/pyspark/sql/tests/connect/test_connect_creation.py 
b/python/pyspark/sql/tests/connect/test_connect_creation.py
index 917320d354e2..7be9959fdcb4 100644
--- a/python/pyspark/sql/tests/connect/test_connect_creation.py
+++ b/python/pyspark/sql/tests/connect/test_connect_creation.py
@@ -54,10 +54,21 @@ class SparkConnectCreationTests(ReusedMixedTestCase, 
PandasOnSparkTestUtils):
         self.assertEqual(rows[0][0], 3)
         self.assertEqual(rows[0][1], "c")
 
-        # Check correct behavior for empty DataFrame
-        pdf = pd.DataFrame({"a": []})
-        with self.assertRaises(ValueError):
-            self.connect.createDataFrame(pdf)
+    def test_from_empty_pandas_dataframe(self):
+        dfs = [
+            pd.DataFrame(),
+            pd.DataFrame({"a": []}),
+            pd.DataFrame(index=range(5)),
+        ]
+
+        for df in dfs:
+            with self.assertRaises(PySparkValueError) as pe:
+                self.connect.createDataFrame(df)
+            self.check_error(
+                exception=pe.exception,
+                errorClass="CANNOT_INFER_EMPTY_SCHEMA",
+                messageParameters={},
+            )
 
     def test_with_local_ndarray(self):
         """SPARK-41446: Test creating a dataframe using local list"""


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-53959][CONNECT] Throw a client-side error when creating a dataframe from a pandas dataframe with an index but no data

Reply via email to