kosiew opened a new pull request, #21469:
URL: https://github.com/apache/datafusion/pull/21469

   ## Which issue does this PR close?
   
   * Part of #21156
   
   ---
   
   ## Rationale for this change
   
   The current `StringAggGroupsAccumulator` eagerly copies every input string 
into per-group `String` buffers during `update_batch`. This approach is simple 
but can lead to significant memory overhead and unnecessary copying, especially 
for large payloads or high-cardinality groupings.
   
   This PR introduces a hybrid strategy that defers copying when it is likely 
to be beneficial. Instead of immediately materializing strings, the accumulator 
can retain references to input batches and store lightweight `(group_idx, 
row_idx)` entries. Actual string concatenation is deferred until `evaluate()`.
   
   This approach aims to:
   
   * Reduce memory duplication for large strings
   * Improve performance for workloads with many groups and large payloads
   * Maintain efficiency for small inputs via an eager fast path
   
   ---
   
   ## What changes are included in this PR?
   
   * Introduced a hybrid accumulation model:
   
     * **Eager path**: existing behavior for small inputs
     * **Deferred path**: stores references to input batches and row indices
   
   * Added new internal structures:
   
     * `batches: Vec<ArrayRef>` to retain input arrays
     * `batch_entries: Vec<Vec<(u32, u32)>>` to track `(group_idx, row_idx)` 
pairs
     * `num_groups` to track total group count
   
   * Introduced `StringInputArray` abstraction to unify handling of:
   
     * `Utf8`
     * `LargeUtf8`
     * `Utf8View`
   
   * Implemented heuristics to decide when to defer:
   
     * `DEFER_GROUP_THRESHOLD`
     * `DEFER_PAYLOAD_LEN_THRESHOLD`
   
   * Refactored append logic:
   
     * `append_rows_typed` for deferred indexing
     * `append_batch_typed` for eager materialization
     * `append_batch_values_typed` for reconstruction during evaluation
   
   * Updated `evaluate()` to:
   
     * Materialize deferred data into output
     * Support partial emits (`EmitTo::First`)
     * Retain un-emitted state correctly
   
   * Added state management improvements:
   
     * `clear_state()` to fully reset buffers
     * `retain_after_emit()` to compact deferred state after partial emits
   
   * Extended memory accounting in `size()` to include:
   
     * retained batches
     * batch entry metadata
   
   * Added test:
   
     * `groups_mixed_eager_and_deferred_batches` to validate correctness of 
hybrid behavior
   
   ---
   
   ## Are these changes tested?
   
   Yes.
   
   * Added a new test covering mixed eager and deferred batches
   * Existing tests continue to pass
   * The new test verifies:
   
     * Correct concatenation across eager + deferred inputs
     * Correct handling of partial emits
     * Proper retention of remaining groups after emission
   
   ---
   
   ## Are there any user-facing changes?
   
   No.
   
   * This change is internal to the execution engine
   * No API or SQL behavior changes
   * Expected improvements are limited to performance and memory usage 
characteristics
   
   ---
   
   ## LLM-generated code disclosure
   
   This PR includes LLM-generated code and comments. All LLM-generated content 
has been manually reviewed and tested.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to