kosiew opened a new pull request, #21437:
URL: https://github.com/apache/datafusion/pull/21437
## Which issue does this PR close?
* Part of #21156
---
## Rationale for this change
The current `string_agg` benchmarks use very small (≈3-byte) string payloads
(e.g., `hi0`..`hi3`). This makes it difficult to observe the real cost of
string aggregation in scenarios where payload sizes are larger, particularly
the cost of copying and memory pressure during grouped aggregation.
This PR introduces configurable UTF-8 payload sizes so benchmarks can better
reflect realistic workloads and expose CPU and memory behavior differences
across payload sizes.
---
## What changes are included in this PR?
* Introduced a new `Utf8PayloadProfile` enum with three profiles:
* `Small` (≈3 bytes, existing baseline)
* `Medium` (≈64 bytes)
* `Large` (≈1024 bytes)
* Added `create_table_provider_with_payload` to allow generating tables with
configurable string payload sizes.
* Refactored record batch generation to use precomputed payload arrays
instead of formatting strings per row.
* Added helper:
* `payload_string` for generating fixed-size string payloads
* `Utf8PayloadProfile::payloads()` for producing 4-value low-cardinality
payload sets
* Updated benchmark setup:
* Introduced `create_context_with_payload`
* Parameterized `string_agg` queries across:
* group cardinality (`few`, `mid`, `many`)
* payload size (`small_3b`, `medium_64b`, `large_1024b`)
* Replaced individual benchmark functions with a `criterion` benchmark group
using `BenchmarkId` to produce a matrix of results.
---
## Are these changes tested?
No new unit tests were added.
Reason:
* This change only affects benchmark utilities and benchmark definitions.
* Existing functionality remains unchanged for default (`Small`) payload
profile.
* Benchmarks themselves act as validation for correctness and performance
behavior.
---
## Are there any user-facing changes?
No.
This PR only impacts internal benchmarking infrastructure and does not
modify public APIs or query behavior.
---
## LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content
has been manually reviewed and tested.
---
## Additional Notes
* The benchmark matrix now isolates the impact of payload size while
preserving low cardinality (4 distinct values), ensuring that observed
differences are primarily due to string size rather than grouping distribution.
* Medium payloads aim to expose copy costs without excessive allocator
overhead, while large payloads stress both CPU and memory behavior.
---
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]