On Tue, 7 Jan 2025 10:39:18 GMT, Shaojin Wen <s...@openjdk.org> wrote:
> In PR #22928, UUID introduced long-based vectorized hexadecimal to string > conversion, which can also be used in Integer::toHexString and > Long::toHexString to eliminate table lookups. The benefit of eliminating > table lookups is that the performance is better when cache misses occur. The testing data from both aarch64 and x64 architectures indicates a performance improvement of 10% to 20%. However, under the MacBook M1 Pro environment, the performance enhancement for the Integer.toHexString scenario has reached 100%. ## 1. Script git remote add wenshao g...@github.com:wenshao/jdk.git git fetch wenshao # baseline 91db7c0877a git checkout 91db7c0877a68ad171da2b4501280fc24630ae83 make test TEST="micro:java.lang.Integers.toHexString" make test TEST="micro:java.lang.Longs.toHexString" # current 1788d09787c git checkout 1788d09787cadfe6ec23b9b10bef87a2cdc029a3 make test TEST="micro:java.lang.Integers.toHexString" make test TEST="micro:java.lang.Longs.toHexString" ## 2. aliyun_ecs_c8a_x64 (CPU AMD EPYC™ Genoa) -Benchmark (size) Mode Cnt Score Error Units (baseline 91db7c0877a) -Integers.toHexString 500 avgt 15 4.855 ± 0.058 us/op -Longs.toHexString 500 avgt 15 6.098 ± 0.034 us/op +Benchmark (size) Mode Cnt Score Error Units (current 1788d09787c) +Integers.toHexString 500 avgt 15 4.105 ± 0.010 us/op +18.27% +Longs.toHexString 500 avgt 15 4.682 ± 0.116 us/op +30.24% ## 3. aliyun_ecs_c8i_x64 (CPU Intel®Xeon®Emerald Rapids) -Benchmark (size) Mode Cnt Score Error Units -Integers.toHexString 500 avgt 15 5.158 ± 0.025 us/op -Longs.toHexString 500 avgt 15 6.072 ± 0.020 us/op +Benchmark (size) Mode Cnt Score Error Units +Integers.toHexString 500 avgt 15 4.691 ± 0.024 us/op +9.95% +Longs.toHexString 500 avgt 15 4.947 ± 0.024 us/op +22.74% ## 4. aliyun_ecs_c8y_aarch64 (CPU Aliyun Yitian 710) -Benchmark (size) Mode Cnt Score Error Units -Integers.toHexString 500 avgt 15 5.880 ± 0.017 us/op -Longs.toHexString 500 avgt 15 7.183 ± 0.013 us/op +Benchmark (size) Mode Cnt Score Error Units +Integers.toHexString 500 avgt 15 5.282 ± 0.012 us/op +11.32% +Longs.toHexString 500 avgt 15 5.530 ± 0.013 us/op +29.89% ## 5. MacBook M1 Pro (aarch64) -Benchmark (size) Mode Cnt Score Error Units (baseline 91db7c0877a) -Integers.toHexString 500 avgt 15 10.519 ? 1.573 us/op -Longs.toHexString 500 avgt 15 5.754 ? 0.264 us/op +Benchmark (size) Mode Cnt Score Error Units (current 1788d09787c) +Integers.toHexString 500 avgt 15 5.057 ? 0.015 us/op +108.00% +Longs.toHexString 500 avgt 15 5.147 ? 0.095 us/op +11.79% Because this algorithm underperforms compared to the original version when handling smaller numbers, I have marked this PR as draft. Additionally, this algorithm is used in another PR #22928 [Speed up UUID::toString](https://github.com/openjdk/jdk/pull/22928) , and it still experiences performance degradation with Long.expand on older CPU architectures. // Method 1: i = Long.reverseBytes(Long.expand(i, 0x0F0F_0F0F_0F0F_0F0FL)); // Method 2: i = ((i & 0xF0000000L) >> 28) | ((i & 0xF000000L) >> 16) | ((i & 0xF00000L) >> 4) | ((i & 0xF0000L) << 8) | ((i & 0xF000L) << 20) | ((i & 0xF00L) << 32) | ((i & 0xF0L) << 44) | ((i & 0xFL) << 56); Note: Using Long.reverseBytes + Long.expand is faster on x64 and ARMv9. However, on AArch64 with ARMv8, it will be slower compared to the manual unrolling shown in Method 2. ARMv8 includes Apple M1/M2, AWS Graviton 3; ARMv9.0 includes Apple M3/M4, Aliyun Yitian 710. I haven't tested this on older x64 CPUs, like the AMD ZEN1, but it's possible that they experience the same issue. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22942#issuecomment-2576197320 PR Comment: https://git.openjdk.org/jdk/pull/22942#issuecomment-2578863538