This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push: new 6d1815eceea2 [SPARK-49718][PS] Switch `Scatter` plot to sampled data 6d1815eceea2 is described below commit 6d1815eceea2003de2e3602f0f64e8188e8288d8 Author: Ruifeng Zheng <ruife...@apache.org> AuthorDate: Thu Sep 19 12:31:48 2024 -0700 [SPARK-49718][PS] Switch `Scatter` plot to sampled data ### What changes were proposed in this pull request? Switch `Scatter` plot to sampled data ### Why are the changes needed? when the data distribution has relationship with the order, the first n rows will not be representative of the whole dataset for example: ``` import pandas as pd import numpy as np import pyspark.pandas as ps # ps.set_option("plotting.max_rows", 10000) np.random.seed(123) pdf = pd.DataFrame(np.random.randn(10000, 4), columns=list('ABCD')).sort_values("A") psdf = ps.DataFrame(pdf) psdf.plot.scatter(x='B', y='A') ``` all 10k datapoints:  before (first 1k datapoints):  after (sampled 1k datapoints):  ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? ci and manually test ### Was this patch authored or co-authored using generative AI tooling? no Closes #48164 from zhengruifeng/ps_scatter_sampling. Authored-by: Ruifeng Zheng <ruife...@apache.org> Signed-off-by: Dongjoon Hyun <dongj...@apache.org> --- python/pyspark/pandas/plot/core.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python/pyspark/pandas/plot/core.py b/python/pyspark/pandas/plot/core.py index 429e97ecf07b..6f036b766924 100644 --- a/python/pyspark/pandas/plot/core.py +++ b/python/pyspark/pandas/plot/core.py @@ -479,7 +479,7 @@ class PandasOnSparkPlotAccessor(PandasObject): "pie": TopNPlotBase().get_top_n, "bar": TopNPlotBase().get_top_n, "barh": TopNPlotBase().get_top_n, - "scatter": TopNPlotBase().get_top_n, + "scatter": SampledPlotBase().get_sampled, "area": SampledPlotBase().get_sampled, "line": SampledPlotBase().get_sampled, } --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org