This is an automated email from the ASF dual-hosted git repository.
viirya pushed a commit to branch branch-4.2
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-4.2 by this push:
new f9a3a9bf0318 [SPARK-55754][PYTHON][TEST][FOLLOWUP] Fix pure_ints type
mismatch in bench
f9a3a9bf0318 is described below
commit f9a3a9bf0318a21dfee2e3c6ed8a6e43739a273d
Author: Liang-Chi Hsieh <[email protected]>
AuthorDate: Wed May 27 23:52:57 2026 -0700
[SPARK-55754][PYTHON][TEST][FOLLOWUP] Fix pure_ints type mismatch in bench
### What changes were proposed in this pull request?
Refactor `MockDataFactory.NAMED_TYPE_POOLS` in
`python/benchmarks/bench_eval_type.py` so the `pure_ints`, `pure_floats`, and
`pure_strings` entries reuse the corresponding `TYPE_REGISTRY` entries instead
of duplicating their factory lambdas.
### Why are the changes needed?
`NAMED_TYPE_POOLS[\"pure_ints\"]` declared the column as `IntegerType()`
(32-bit) but generated data with `np.int64`. Because every benchmark that uses
this pool runs through serializers with `arrow_cast=True`, the mismatch was
silently corrected by a 64-to-32 narrowing cast inside the pandas/arrow
conversion path -- meaning the `pure_ints` scenario in seven mixins
(`ArrowBatchedUDF`, `ArrowUDTF`, `ArrowTableUDF`, `MapArrowIterUDF`,
`MapPandasIterUDF`, `ScalarArrowUDF`, `ScalarPandasU [...]
`pure_floats` and `pure_strings` had no such mismatch but duplicated the
same lambdas as `TYPE_REGISTRY[\"double\"]` / `TYPE_REGISTRY[\"string\"]`,
risking drift in future edits. Reusing the registry entries eliminates the
duplication. `pure_ts` is left as-is because no matching `TYPE_REGISTRY` entry
exists.
### Does this PR introduce _any_ user-facing change?
No. Test-only change in the benchmark module.
### How was this patch tested?
- Confirmed `NAMED_TYPE_POOLS[\"pure_ints\"][0]` now produces a
`pa.int32()` array matching its `IntegerType()` declaration (was `pa.int64()`).
- Confirmed `pure_floats` and `pure_strings` still produce `pa.float64()`
and `pa.string()` arrays after the refactor.
- Ran `setup` + `time_worker` for the `pure_ints` scenario across all seven
affected `*TimeBench` classes; all passed.
### Was this patch authored or co-authored using generative AI tooling?
Yes. Generated-by: Claude Code (claude-opus-4-7)
Closes #56169 from viirya/SPARK-55724-pure-ints-followup.
Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>
(cherry picked from commit fc5abd63c107e41a145239c28b3524176b94013f)
Signed-off-by: Liang-Chi Hsieh <[email protected]>
---
python/benchmarks/bench_eval_type.py | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/python/benchmarks/bench_eval_type.py
b/python/benchmarks/bench_eval_type.py
index 845b54021775..af6189a4560e 100644
--- a/python/benchmarks/bench_eval_type.py
+++ b/python/benchmarks/bench_eval_type.py
@@ -200,11 +200,9 @@ class MockDataFactory:
NAMED_TYPE_POOLS: dict[str, list[tuple[Callable, Any]]] = {
"mixed": MIXED_TYPES,
- "pure_ints": [
- (lambda r: pa.array(np.random.randint(0, 1000, r,
dtype=np.int64)), IntegerType())
- ],
- "pure_floats": [(lambda r: pa.array(np.random.rand(r)), DoubleType())],
- "pure_strings": [(lambda r: pa.array([f"s{j}" for j in range(r)]),
StringType())],
+ "pure_ints": [TYPE_REGISTRY["int"]],
+ "pure_floats": [TYPE_REGISTRY["double"]],
+ "pure_strings": [TYPE_REGISTRY["string"]],
"pure_ts": [
(
lambda r: pa.array(
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]