This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new 8ac083c08ef8 [SPARK-55674][PYTHON] Optimize 0-column table conversion 
in Spark Connect
8ac083c08ef8 is described below

commit 8ac083c08ef8ca4b1d7d1baa84b20fbc119adbeb
Author: Yicong-Huang <[email protected]>
AuthorDate: Wed Feb 25 15:56:27 2026 +0900

    [SPARK-55674][PYTHON] Optimize 0-column table conversion in Spark Connect
    
    ### What changes were proposed in this pull request?
    
    Replace `pa.Table.from_struct_array(pa.array([{}] * len(data), 
type=pa.struct([])))` with 
`pa.Table.from_batches([pa.RecordBatch.from_pandas(data)])` in 
`connect/session.py` when handling 0-column pandas DataFrames. This is O(1) 
operation, regardless how many rows are there.
    
    ### Why are the changes needed?
    
    The original approach constructs `len(data)` Python dict objects (`[{}] * 
len(data)`), which is O(n). `pa.RecordBatch.from_pandas` is an O(1) operation 
regardless of the number of rows, as it reads row
       count directly from pandas index metadata without allocating per-row 
Python objects.
    
    ### Does this PR introduce any user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Existing tests.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #54468 from 
Yicong-Huang/SPARK-55674/followup/unify-zero-column-pandas-arrow-fix.
    
    Authored-by: Yicong-Huang <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
---
 python/pyspark/sql/connect/session.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/sql/connect/session.py 
b/python/pyspark/sql/connect/session.py
index 384d10c2ae58..f9a360ec6054 100644
--- a/python/pyspark/sql/connect/session.py
+++ b/python/pyspark/sql/connect/session.py
@@ -622,8 +622,9 @@ class SparkSession:
             safecheck = 
configs["spark.sql.execution.pandas.convertToArrowArraySafely"]
 
             # Handle the 0-column case separately to preserve row count.
+            # pa.RecordBatch.from_pandas preserves num_rows via pandas index 
metadata.
             if len(data.columns) == 0:
-                _table = pa.Table.from_struct_array(pa.array([{}] * len(data), 
type=pa.struct([])))
+                _table = 
pa.Table.from_batches([pa.RecordBatch.from_pandas(data)])
             else:
                 _table = pa.Table.from_batches(
                     [


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to