This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new 29fc55980059 [SPARK-55296][PS][FOLLOW-UP] Fix CoW mode not to break 
groupby
29fc55980059 is described below

commit 29fc5598005903e1e99a46f6065d2d2ed6b7285a
Author: Takuya Ueshin <[email protected]>
AuthorDate: Fri Feb 20 13:26:37 2026 +0900

    [SPARK-55296][PS][FOLLOW-UP] Fix CoW mode not to break groupby
    
    ### What changes were proposed in this pull request?
    
    This is a follow-up of apache/spark#54375.
    
    Fixes CoW mode not to break `groupby`.
    Delays to disconnect the anchor to when actually being updated.
    
    ### Why are the changes needed?
    
    The CoW mode was supported at apache/spark#54375, but it disconnected the 
anchor too early, causing to break `groupby`.
    
    ```py
    >>> import pandas as pd
    >>> import pyspark.pandas as ps
    >>>
    >>> pdf1 = pd.DataFrame({"C": [0.362, 0.227, 1.267, -0.562], "B": [1, 2, 3, 
4]})
    >>> pdf2 = pd.DataFrame({"A": [1, 1, 2, 2]})
    >>>
    >>> psdf1 = ps.from_pandas(pdf1)
    >>> psdf2 = ps.from_pandas(pdf2)
    >>>
    >>> pdf1.groupby([pdf1.C, pdf2.A]).agg("sum").sort_index()
              B
    C      A
    -0.562 2  4
     0.227 1  2
     0.362 1  1
     1.267 2  3
    >>> psdf1.groupby([psdf1.C, psdf2.A]).agg("sum").sort_index()
                  C  B
    C      A
    -0.562 2 -0.562  4
     0.227 1  0.227  2
     0.362 1  0.362  1
     1.267 2  1.267  3
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, it will behave more like pandas 3.
    
    ### How was this patch tested?
    
    The existing tests should pass.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    Codex (GPT-5.3-Codex)
    
    Closes #54392 from ueshin/issues/SPARK-55296/fix_groupby.
    
    Authored-by: Takuya Ueshin <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
---
 python/pyspark/pandas/indexing.py | 16 +++++++++++++++-
 python/pyspark/pandas/series.py   |  5 +----
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/python/pyspark/pandas/indexing.py 
b/python/pyspark/pandas/indexing.py
index fea5e3f55a56..f5f42b6fda89 100644
--- a/python/pyspark/pandas/indexing.py
+++ b/python/pyspark/pandas/indexing.py
@@ -587,6 +587,16 @@ class LocIndexerLike(IndexerLike, metaclass=ABCMeta):
         from pyspark.pandas.series import Series, first_series
 
         if self._is_series:
+            if LooseVersion(pd.__version__) >= "3.0.0":
+                # pandas 3 CoW: mutating a Series view should not mutate the 
parent DataFrame.
+                self._psdf_or_psser._update_anchor(
+                    DataFrame(
+                        self._psdf_or_psser._psdf._internal.select_column(
+                            self._psdf_or_psser._column_label
+                        )
+                    )
+                )
+
             if (
                 isinstance(key, Series)
                 and (isinstance(self, iLocIndexer) or not same_anchor(key, 
self._psdf_or_psser))
@@ -811,7 +821,11 @@ class LocIndexerLike(IndexerLike, metaclass=ABCMeta):
             internal = self._internal.with_new_columns(
                 new_data_spark_columns, column_labels=column_labels, 
data_fields=new_fields
             )
-            self._psdf_or_psser._update_internal_frame(internal, 
check_same_anchor=False)
+            self._psdf_or_psser._update_internal_frame(
+                internal,
+                check_same_anchor=False,
+                anchor_force_disconnect=LooseVersion(pd.__version__) >= 
"3.0.0",
+            )
 
 
 class LocIndexer(LocIndexerLike):
diff --git a/python/pyspark/pandas/series.py b/python/pyspark/pandas/series.py
index 882d44880b47..72d49574423b 100644
--- a/python/pyspark/pandas/series.py
+++ b/python/pyspark/pandas/series.py
@@ -430,10 +430,7 @@ class Series(Frame, IndexOpsMixin, Generic[T]):
             assert not copy
             assert fastpath is no_default
 
-            if LooseVersion(pd.__version__) < "3.0.0":
-                self._anchor = data
-            else:
-                self._anchor = DataFrame(data)
+            self._anchor = data
             self._col_label = index
 
         elif isinstance(data, Series):


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to