This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new 6d1815eceea2 [SPARK-49718][PS] Switch `Scatter` plot to sampled data
6d1815eceea2 is described below

commit 6d1815eceea2003de2e3602f0f64e8188e8288d8
Author: Ruifeng Zheng <ruife...@apache.org>
AuthorDate: Thu Sep 19 12:31:48 2024 -0700

    [SPARK-49718][PS] Switch `Scatter` plot to sampled data
    
    ### What changes were proposed in this pull request?
    Switch `Scatter` plot to sampled data
    
    ### Why are the changes needed?
    when the data distribution has relationship with the order, the first n 
rows will not be representative of the whole dataset
    
    for example:
    ```
    import pandas as pd
    import numpy as np
    import pyspark.pandas as ps
    
    # ps.set_option("plotting.max_rows", 10000)
    np.random.seed(123)
    
    pdf = pd.DataFrame(np.random.randn(10000, 4), 
columns=list('ABCD')).sort_values("A")
    psdf = ps.DataFrame(pdf)
    
    psdf.plot.scatter(x='B', y='A')
    ```
    
    all 10k datapoints:
    
![image](https://github.com/user-attachments/assets/72cf7e97-ad10-41e0-a8a6-351747d5285f)
    
    before (first 1k datapoints):
    
![image](https://github.com/user-attachments/assets/1ed50d2c-7772-4579-a84c-6062542d9367)
    
    after (sampled 1k datapoints):
    
![image](https://github.com/user-attachments/assets/6c684cba-4119-4c38-8228-2bedcdeb9e59)
    
    ### Does this PR introduce _any_ user-facing change?
    yes
    
    ### How was this patch tested?
    ci and manually test
    
    ### Was this patch authored or co-authored using generative AI tooling?
    no
    
    Closes #48164 from zhengruifeng/ps_scatter_sampling.
    
    Authored-by: Ruifeng Zheng <ruife...@apache.org>
    Signed-off-by: Dongjoon Hyun <dongj...@apache.org>
---
 python/pyspark/pandas/plot/core.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/pyspark/pandas/plot/core.py 
b/python/pyspark/pandas/plot/core.py
index 429e97ecf07b..6f036b766924 100644
--- a/python/pyspark/pandas/plot/core.py
+++ b/python/pyspark/pandas/plot/core.py
@@ -479,7 +479,7 @@ class PandasOnSparkPlotAccessor(PandasObject):
         "pie": TopNPlotBase().get_top_n,
         "bar": TopNPlotBase().get_top_n,
         "barh": TopNPlotBase().get_top_n,
-        "scatter": TopNPlotBase().get_top_n,
+        "scatter": SampledPlotBase().get_sampled,
         "area": SampledPlotBase().get_sampled,
         "line": SampledPlotBase().get_sampled,
     }


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to