Repository: spark
Updated Branches:
refs/heads/master e2efe0529 -> 0f576a574
[SPARK-15244] [PYTHON] Type of column name created with createDataFrame is not
consistent.
## What changes were proposed in this pull request?
**createDataFrame** returns inconsistent types for column names.
```python
>>> from pyspark.sql.types import StructType, StructField, StringType
>>> schema = StructType([StructField(u"col", StringType())])
>>> df1 = spark.createDataFrame([("a",)], schema)
>>> df1.columns # "col" is str
['col']
>>> df2 = spark.createDataFrame([("a",)], [u"col"])
>>> df2.columns # "col" is unicode
[u'col']
```
The reason is only **StructField** has the following code.
```
if not isinstance(name, str):
name = name.encode('utf-8')
```
This PR adds the same logic into **createDataFrame** for consistency.
```
if isinstance(schema, list):
schema = [x.encode('utf-8') if not isinstance(x, str) else x for x in
schema]
```
## How was this patch tested?
Pass the Jenkins test (with new python doctest)
Author: Dongjoon Hyun <[email protected]>
Closes #13097 from dongjoon-hyun/SPARK-15244.
Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0f576a57
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0f576a57
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0f576a57
Branch: refs/heads/master
Commit: 0f576a5748244f7e874b925f8d841f1ca238f087
Parents: e2efe05
Author: Dongjoon Hyun <[email protected]>
Authored: Tue May 17 13:05:07 2016 -0700
Committer: Davies Liu <[email protected]>
Committed: Tue May 17 13:05:07 2016 -0700
----------------------------------------------------------------------
python/pyspark/sql/session.py | 2 ++
python/pyspark/sql/tests.py | 7 +++++++
2 files changed, 9 insertions(+)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/spark/blob/0f576a57/python/pyspark/sql/session.py
----------------------------------------------------------------------
diff --git a/python/pyspark/sql/session.py b/python/pyspark/sql/session.py
index ae31435..0781b44 100644
--- a/python/pyspark/sql/session.py
+++ b/python/pyspark/sql/session.py
@@ -465,6 +465,8 @@ class SparkSession(object):
return (obj, )
schema = StructType().add("value", datatype)
else:
+ if isinstance(schema, list):
+ schema = [x.encode('utf-8') if not isinstance(x, str) else x
for x in schema]
prepare = lambda obj: obj
if isinstance(data, RDD):
http://git-wip-us.apache.org/repos/asf/spark/blob/0f576a57/python/pyspark/sql/tests.py
----------------------------------------------------------------------
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 0c73f58..0977c43 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -228,6 +228,13 @@ class SQLTests(ReusedPySparkTestCase):
self.assertRaises(AnalysisException, lambda: df.select(df.c).first())
self.assertRaises(AnalysisException, lambda:
df.select(df["c"]).first())
+ def test_column_name_encoding(self):
+ """Ensure that created columns has `str` type consistently."""
+ columns = self.spark.createDataFrame([('Alice', 1)], ['name',
u'age']).columns
+ self.assertEqual(columns, ['name', 'age'])
+ self.assertTrue(isinstance(columns[0], str))
+ self.assertTrue(isinstance(columns[1], str))
+
def test_explode(self):
from pyspark.sql.functions import explode
d = [Row(a=1, intlist=[1, 2, 3], mapfield={"a": "b"})]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]