Rich-T-kid opened a new issue, #21466: URL: https://github.com/apache/datafusion/issues/21466
### Two related performance improvements for dictionary-encoded columns during aggregation: **Byte-wise row comparisons** — when checking for existing groups, group_rows.row(row) == group_values.row(*group_idx) compares raw bytes. Since dictionary encoding represents repeated values as integer indices, comparing those integers directly is significantly cheaper. **Hash operations** — hash calls on dictionary-encoded columns currently operate on the full string/primitive values. Replacing these with hashes of the integer dictionary keys is a minor but meaningful performance gain. ### Relevant Packages **Hash operations**: datafusion-common **Byte-wise row comparisons**: datafusion-physical-plan ### Goals 1. Byte-wise row comparisons updated to compare dictionary integer indices instead of raw bytes 2. Hash calls on dictionary-encoded columns updated to use integer key values rather than decoded primitives 3. Performance benchmarks demonstrate measurable improvement 4. Regression tests for hashing confirm dictionary encoding changes don't violate hash guarantees **Note**: Dictionary key→value mappings are not stable across RecordBatches — the same key index can map to a different value in a different batch. Any new logic must not assume otherwise. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
