[PR] Add configurable UTF8 payload profiles to string_agg benchmarks and parameterize benchmark matrix [datafusion]

via GitHub Tue, 07 Apr 2026 05:04:31 -0700


kosiew opened a new pull request, #21437:
URL: https://github.com/apache/datafusion/pull/21437


   ## Which issue does this PR close?
   
   * Part of #21156
   
   ---
   
   ## Rationale for this change
   
   The current `string_agg` benchmarks use very small (≈3-byte) string payloads 
(e.g., `hi0`..`hi3`). This makes it difficult to observe the real cost of 
string aggregation in scenarios where payload sizes are larger, particularly 
the cost of copying and memory pressure during grouped aggregation.
   
   This PR introduces configurable UTF-8 payload sizes so benchmarks can better 
reflect realistic workloads and expose CPU and memory behavior differences 
across payload sizes.
   
   ---
   
   ## What changes are included in this PR?
   
   * Introduced a new `Utf8PayloadProfile` enum with three profiles:
   
     * `Small` (≈3 bytes, existing baseline)
     * `Medium` (≈64 bytes)
     * `Large` (≈1024 bytes)
   
   * Added `create_table_provider_with_payload` to allow generating tables with 
configurable string payload sizes.
   
   * Refactored record batch generation to use precomputed payload arrays 
instead of formatting strings per row.
   
   * Added helper:
   
     * `payload_string` for generating fixed-size string payloads
     * `Utf8PayloadProfile::payloads()` for producing 4-value low-cardinality 
payload sets
   
   * Updated benchmark setup:
   
     * Introduced `create_context_with_payload`
     * Parameterized `string_agg` queries across:
   
       * group cardinality (`few`, `mid`, `many`)
       * payload size (`small_3b`, `medium_64b`, `large_1024b`)
   
   * Replaced individual benchmark functions with a `criterion` benchmark group 
using `BenchmarkId` to produce a matrix of results.
   
   ---
   
   ## Are these changes tested?
   
   No new unit tests were added.
   
   Reason:
   
   * This change only affects benchmark utilities and benchmark definitions.
   * Existing functionality remains unchanged for default (`Small`) payload 
profile.
   * Benchmarks themselves act as validation for correctness and performance 
behavior.
   
   ---
   
   ## Are there any user-facing changes?
   
   No.
   
   This PR only impacts internal benchmarking infrastructure and does not 
modify public APIs or query behavior.
   
   ---
   
   ## LLM-generated code disclosure
   
   This PR includes LLM-generated code and comments. All LLM-generated content 
has been manually reviewed and tested.
   
   ---
   
   ## Additional Notes
   
   * The benchmark matrix now isolates the impact of payload size while 
preserving low cardinality (4 distinct values), ensuring that observed 
differences are primarily due to string size rather than grouping distribution.
   * Medium payloads aim to expose copy costs without excessive allocator 
overhead, while large payloads stress both CPU and memory behavior.
   
   ---
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Add configurable UTF8 payload profiles to string_agg benchmarks and parameterize benchmark matrix [datafusion]

Reply via email to