Scolliq opened a new pull request, #21836:
URL: https://github.com/apache/datafusion/pull/21836
Refs #15986.
**Why:** `spark_hex` walked one nibble at a time — two `HEX_CHARS[i]`
lookups and two `Vec::push` calls per input byte. The hot loop flattens into
one indexed load and one `extend_from_slice` per byte with a precomputed table.
**What changed:** added `HEX_LOOKUP_LOWER` / `HEX_LOOKUP_UPPER` as `[[u8;
2]; 256]` const tables built at compile time. Bytes path now does a single
lookup + 2-byte extend per input byte. The int64 path consumes two nibbles per
iteration via the same table, with a fall-through for the high nibble.
Behaviour for `0`, `i64::MAX`, `i64::MIN`, `-1` preserved.
**Tests:** extended `test_hex_int64` to cover edge values; new
`test_hex_lookup_table_covers_all_bytes` cross-checks every entry against
`format!("{:02X/x}")`; new `test_spark_hex_binary_round_trip_all_bytes` feeds
all 256 byte values through `spark_hex` and verifies the result.
`cargo test -p datafusion-spark --lib hex` → 8 pass. `cargo clippy
--all-features --all-targets` clean. `cargo bench --no-run` builds — existing
`benches/hex.rs` already covers
Int64/Utf8/Utf8View/LargeUtf8/Binary/LargeBinary plus dict paths.
**Not in this PR:** the #15947 review also flagged Utf8View output and
dictionary-key reuse — those felt worth their own PRs to keep this focused on
the per-byte hot path.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]