This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 9fae65bb220e [SPARK-53959][CONNECT] Throw a client-side error when
creating a dataframe from a pandas dataframe with an index but no data
9fae65bb220e is described below
commit 9fae65bb220e77f5d3de633a84ae2ed6af270f1d
Author: Alex Khakhlyuk <[email protected]>
AuthorDate: Fri Oct 31 13:18:57 2025 +0900
[SPARK-53959][CONNECT] Throw a client-side error when creating a dataframe
from a pandas dataframe with an index but no data
### What changes were proposed in this pull request?
Spark Connect Python client does not throw a proper error when creating a
dataframe from a pandas dataframe with a index and empty data.
Generally, spark connect client throws a client-side error
`[CANNOT_INFER_EMPTY_SCHEMA] Can not infer schema from an empty dataset`. when
creating a dataframe without data, for example via
```
spark.createDataFrame([]).show()
```
or
```
df = pd.DataFrame()
spark.createDataFrame(df).show()
```
or
```
df = pd.DataFrame({"a": []})
spark.createDataFrame(df).show()
```
This does not happen when pandas dataframe has an index but no data, e.g.
```
df = pd.DataFrame(index=range(5))
spark.createDataFrame(df).show()
```
What happens instead is that the dataframe is successfully converted to a
LocalRelation on the client, is sent to the server, but the server then throws
the following exception: `INTERNAL_ERROR: Input data for LocalRelation does not
produce a schema. SQLSTATE: XX000`. XX000 is an internal error sql state and
the error is not actionable enough for the user.
This PR fixes this problem by throwing `CANNOT_INFER_EMPTY_SCHEMA` in case
the dataframe has rows (because of the index), but does not have columns.
### Why are the changes needed?
Currently the error is thrown as a server-side internal error and the error
message is not actionable enough for the user.
### Does this PR introduce _any_ user-facing change?
Creating a spark connect dataframe from an pandas dataframe with an index
but no data will now throw a client-side error `[CANNOT_INFER_EMPTY_SCHEMA] Can
not infer schema from an empty dataset` instead of the server-side
`INTERNAL_ERROR: Input data for LocalRelation does not produce a schema.
SQLSTATE: XX000`.
### How was this patch tested?
New unit test.
### Was this patch authored or co-authored using generative AI tooling?
No
Closes #52670 from khakhlyuk/pd-empty-data.
Authored-by: Alex Khakhlyuk <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
---
python/pyspark/sql/connect/session.py | 5 +++++
.../sql/tests/connect/test_connect_creation.py | 19 +++++++++++++++----
2 files changed, 20 insertions(+), 4 deletions(-)
diff --git a/python/pyspark/sql/connect/session.py
b/python/pyspark/sql/connect/session.py
index 2a678c95c925..21a7c8329a35 100644
--- a/python/pyspark/sql/connect/session.py
+++ b/python/pyspark/sql/connect/session.py
@@ -559,6 +559,11 @@ class SparkSession:
# If no schema supplied by user then get the names of columns only
if schema is None:
_cols = [str(x) if not isinstance(x, str) else x for x in
data.columns]
+ if len(_cols) == 0:
+ raise PySparkValueError(
+ errorClass="CANNOT_INFER_EMPTY_SCHEMA",
+ messageParameters={},
+ )
infer_pandas_dict_as_map = (
configs["spark.sql.execution.pandas.inferPandasDictAsMap"]
== "true"
)
diff --git a/python/pyspark/sql/tests/connect/test_connect_creation.py
b/python/pyspark/sql/tests/connect/test_connect_creation.py
index 917320d354e2..7be9959fdcb4 100644
--- a/python/pyspark/sql/tests/connect/test_connect_creation.py
+++ b/python/pyspark/sql/tests/connect/test_connect_creation.py
@@ -54,10 +54,21 @@ class SparkConnectCreationTests(ReusedMixedTestCase,
PandasOnSparkTestUtils):
self.assertEqual(rows[0][0], 3)
self.assertEqual(rows[0][1], "c")
- # Check correct behavior for empty DataFrame
- pdf = pd.DataFrame({"a": []})
- with self.assertRaises(ValueError):
- self.connect.createDataFrame(pdf)
+ def test_from_empty_pandas_dataframe(self):
+ dfs = [
+ pd.DataFrame(),
+ pd.DataFrame({"a": []}),
+ pd.DataFrame(index=range(5)),
+ ]
+
+ for df in dfs:
+ with self.assertRaises(PySparkValueError) as pe:
+ self.connect.createDataFrame(df)
+ self.check_error(
+ exception=pe.exception,
+ errorClass="CANNOT_INFER_EMPTY_SCHEMA",
+ messageParameters={},
+ )
def test_with_local_ndarray(self):
"""SPARK-41446: Test creating a dataframe using local list"""
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]