This is an automated email from the ASF dual-hosted git repository.

viirya pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new 7da33f344f46 [SPARK-56120][PYTHON][TEST][FOLLOWUP] Make 
_WindowAggArrowBenchMixin scenarios lazy
7da33f344f46 is described below

commit 7da33f344f46ba9282f4030264b31b173dca4703
Author: Liang-Chi Hsieh <[email protected]>
AuthorDate: Wed May 27 23:42:53 2026 -0700

    [SPARK-56120][PYTHON][TEST][FOLLOWUP] Make _WindowAggArrowBenchMixin 
scenarios lazy
    
    ### What changes were proposed in this pull request?
    
    Convert `_WindowAggArrowBenchMixin` in 
`python/benchmarks/bench_eval_type.py`
    to the lazy `_scenario_configs` + `staticmethod _build_scenario(name)` 
pattern
    used by every other mixin in the file, matching the immediately-following
    `_WindowAggPandasBenchMixin`.
    
    ### Why are the changes needed?
    
    SPARK-56244 follow-up (commit 1c807ade4a4) removed eager `_scenarios = 
_build_scenarios()` from all mixins so that importing the benchmark module no 
longer materializes every scenario's Arrow data -- a prerequisite for accurate 
per-scenario `peakmem_*` readings under ASV (ASV reports the max RSS observed 
in the worker process, so any import-time allocation inflates every subsequent 
peakmem result).
    
    SPARK-56120 (`78aaf11728b`, merged the day after the follow-up) 
reintroduced the eager pattern in `_WindowAggArrowBenchMixin`, leaving it as 
the only mixin in the file still doing class-body data construction. As a 
result, `WindowAggArrowUDFPeakmemBench` readings are dominated by the global 
import-time allocation rather than the per-scenario footprint.
    
    Measured locally with `tracemalloc`:
    - before: import peak = 394.54 MiB
    - after:  import peak =  29.17 MiB
    
    ### Does this PR introduce _any_ user-facing change?
    
    No. Test-only change in the benchmark module.
    
    ### How was this patch tested?
    
    - Imported `python.benchmarks.bench_eval_type` and asserted the lazy 
structure is in place (`_scenario_configs` present, `_scenarios` absent, 
`_build_scenario` is a staticmethod).
    - Ran `WindowAggArrowUDFTimeBench.setup` + `time_worker` for 
`(many_groups_sm, few_groups_sm) x (sum_udf, mean_multi_udf)`.
    - Ran `WindowAggArrowUDFPeakmemBench.setup` + `peakmem_worker` for 
`many_groups_sm/sum_udf`.
    - Compared import-time peak memory before/after (numbers above).
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    Yes. Generated-by: Claude Code (claude-opus-4-7)
    
    Closes #56167 from viirya/SPARK-56120-followup.
    
    Authored-by: Liang-Chi Hsieh <[email protected]>
    Signed-off-by: Liang-Chi Hsieh <[email protected]>
---
 python/benchmarks/bench_eval_type.py | 47 ++++++++++++++++--------------------
 1 file changed, 21 insertions(+), 26 deletions(-)

diff --git a/python/benchmarks/bench_eval_type.py 
b/python/benchmarks/bench_eval_type.py
index c75e4490d1ed..131ced87dfc8 100644
--- a/python/benchmarks/bench_eval_type.py
+++ b/python/benchmarks/bench_eval_type.py
@@ -1795,41 +1795,36 @@ class _WindowAggArrowBenchMixin:
 
         return (pc.mean(col0).as_py() or 0) + (pc.mean(col1).as_py() or 0)
 
-    def _build_scenarios():
-        """Build scenarios for SQL_WINDOW_AGG_ARROW_UDF.
-
-        Returns a dict mapping scenario name to ``(groups, schema)``.
-        """
-        scenarios = {}
-
-        for name, (num_groups, rows_per_group, n_cols) in {
-            "few_groups_sm": (50, 5_000, 5),
-            "few_groups_lg": (50, 50_000, 5),
-            "many_groups_sm": (2_000, 500, 5),
-            "many_groups_lg": (500, 10_000, 5),
-            "wide_cols": (200, 5_000, 20),
-        }.items():
-            groups, schema = MockDataFactory.make_grouped_batches(
-                num_groups=num_groups,
-                num_rows=rows_per_group,
-                num_cols=n_cols,
-                spark_type_pool=MockDataFactory.NUMERIC_TYPES,
-                batch_size=rows_per_group,
-            )
-            scenarios[name] = (groups, schema)
+    _scenario_configs = {
+        "few_groups_sm": (50, 5_000, 5),
+        "few_groups_lg": (50, 50_000, 5),
+        "many_groups_sm": (2_000, 500, 5),
+        "many_groups_lg": (500, 10_000, 5),
+        "wide_cols": (200, 5_000, 20),
+    }
 
-        return scenarios
+    @staticmethod
+    def _build_scenario(name):
+        """Build a single scenario by name."""
+        np.random.seed(42)
+        num_groups, rows_per_group, n_cols = 
_WindowAggArrowBenchMixin._scenario_configs[name]
+        return MockDataFactory.make_grouped_batches(
+            num_groups=num_groups,
+            num_rows=rows_per_group,
+            num_cols=n_cols,
+            spark_type_pool=MockDataFactory.NUMERIC_TYPES,
+            batch_size=rows_per_group,
+        )
 
-    _scenarios = _build_scenarios()
     _udfs = {
         "sum_udf": _window_agg_arrow_sum,
         "mean_multi_udf": _window_agg_arrow_mean_multi,
     }
-    params = [list(_scenarios), list(_udfs)]
+    params = [list(_scenario_configs), list(_udfs)]
     param_names = ["scenario", "udf"]
 
     def _write_scenario(self, scenario, udf_name, buf):
-        groups, _schema = self._scenarios[scenario]
+        groups, _schema = self._build_scenario(scenario)
         udf_func = self._udfs[udf_name]
 
         # sum_udf uses 1 arg, mean_multi_udf uses 2 args


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to