Rich-T-kid opened a new issue, #21466:
URL: https://github.com/apache/datafusion/issues/21466

   ### Two related performance improvements for dictionary-encoded columns 
during aggregation:
   
   **Byte-wise row comparisons** — when checking for existing groups, 
group_rows.row(row) == group_values.row(*group_idx) compares raw bytes. Since 
dictionary encoding represents repeated values as integer indices, comparing 
those integers directly is significantly cheaper.
   **Hash operations** — hash calls on dictionary-encoded columns currently 
operate on the full string/primitive values. Replacing these with hashes of the 
integer dictionary keys is a minor but meaningful performance gain.
   
   ### Relevant Packages
   
   **Hash operations**: datafusion-common
   **Byte-wise row comparisons**: datafusion-physical-plan
   
   ### Goals
   1. Byte-wise row comparisons updated to compare dictionary integer indices 
instead of raw bytes
   2. Hash calls on dictionary-encoded columns updated to use integer key 
values rather than decoded primitives
   3. Performance benchmarks demonstrate measurable improvement
   4. Regression tests for hashing confirm dictionary encoding changes don't 
violate hash guarantees
   
   
   **Note**: Dictionary key→value mappings are not stable across RecordBatches 
— the same key index can map to a different value in a different batch. Any new 
logic must not assume otherwise.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to