On Fri, 9 May 2025 07:35:41 GMT, Xiaohong Gong <xg...@openjdk.org> wrote:
> JDK-8318650 introduced hotspot intrinsification of subword gather load APIs > for X86 platforms [1]. However, the current implementation is not optimal for > AArch64 SVE platform, which natively supports vector instructions for subword > gather load operations using an int vector for indices (see [2][3]). > > Two key areas require improvement: > 1. At the Java level, vector indices generated for range validation could be > reused for the subsequent gather load operation on architectures with native > vector instructions like AArch64 SVE. However, the current implementation > prevents compiler reuse of these index vectors due to divergent control flow, > potentially impacting performance. > 2. At the compiler IR level, the additional `offset` input for > `LoadVectorGather`/`LoadVectorGatherMasked` with subword types increases IR > complexity and complicates backend implementation. Furthermore, generating > `add` instructions before each memory access negatively impacts performance. > > This patch refactors the implementation at both the Java level and compiler > mid-end to improve efficiency and maintainability across different > architectures. > > Main changes: > 1. Java-side API refactoring: > - Explicitly passes generated index vectors to hotspot, eliminating > duplicate index vectors for gather load instructions on > architectures like AArch64. > 2. C2 compiler IR refactoring: > - Refactors `LoadVectorGather`/`LoadVectorGatherMasked` IR for subword > types by removing the memory offset input and incorporating it into the > memory base `addr` at the IR level. This simplifies backend implementation, > reduces add operations, and unifies the IR across all types. > 3. Backend changes: > - Streamlines X86 implementation of subword gather operations following > the removal of the offset input from the IR level. > > Performance: > The performance of the relative JMH improves up to 27% on a X86 AVX512 > system. Please see the data below: > > Benchmark Mode Cnt Unit > SIZE Before After Gain > GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms > 64 53682.012 52650.325 0.98 > GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms > 256 14484.252 14255.156 0.98 > GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms > 1024 3664.900 3595.615 0.98 > GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms > 4096 908.312 935.269 1.02 > GatherOperationsBenchmark.micr... Hi the above counted loop recognizer patch is merged. Hence I'v rebased this PR to latest jdk master. Following is the new performance data of the subword gather JMHs on X86: Benchmark SIZE Mode Cnt Unit Before After Gain GatherOperationsBenchmark.microByteGather128 64 thrpt 30 ops/ms 44221.691 46837.124 1.05 GatherOperationsBenchmark.microByteGather128 256 thrpt 30 ops/ms 11245.455 12243.045 1.08 GatherOperationsBenchmark.microByteGather128 1024 thrpt 30 ops/ms 2825.246 3096.460 1.09 GatherOperationsBenchmark.microByteGather128 4096 thrpt 30 ops/ms 705.927 775.039 1.09 GatherOperationsBenchmark.microByteGather128_MASK 64 thrpt 30 ops/ms 46783.479 46357.684 0.99 GatherOperationsBenchmark.microByteGather128_MASK 256 thrpt 30 ops/ms 12810.405 12880.347 1.00 GatherOperationsBenchmark.microByteGather128_MASK 1024 thrpt 30 ops/ms 3150.320 3239.281 1.02 GatherOperationsBenchmark.microByteGather128_MASK 4096 thrpt 30 ops/ms 794.151 830.464 1.04 GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 64 thrpt 30 ops/ms 43189.395 47127.449 1.09 GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 256 thrpt 30 ops/ms 11543.128 13196.158 1.14 GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 1024 thrpt 30 ops/ms 2835.053 3300.357 1.16 GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 4096 thrpt 30 ops/ms 719.470 843.290 1.17 GatherOperationsBenchmark.microByteGather128_NZ_OFF 64 thrpt 30 ops/ms 44143.887 46836.788 1.06 GatherOperationsBenchmark.microByteGather128_NZ_OFF 256 thrpt 30 ops/ms 12206.908 12255.677 1.00 GatherOperationsBenchmark.microByteGather128_NZ_OFF 1024 thrpt 30 ops/ms 3094.232 3095.931 1.00 GatherOperationsBenchmark.microByteGather128_NZ_OFF 4096 thrpt 30 ops/ms 776.293 774.336 0.99 GatherOperationsBenchmark.microByteGather256 64 thrpt 30 ops/ms 46247.977 46803.899 1.01 GatherOperationsBenchmark.microByteGather256 256 thrpt 30 ops/ms 12198.878 12250.315 1.00 GatherOperationsBenchmark.microByteGather256 1024 thrpt 30 ops/ms 3093.356 3100.107 1.00 GatherOperationsBenchmark.microByteGather256 4096 thrpt 30 ops/ms 774.611 774.890 1.00 GatherOperationsBenchmark.microByteGather256_MASK 64 thrpt 30 ops/ms 46873.725 47967.422 1.02 GatherOperationsBenchmark.microByteGather256_MASK 256 thrpt 30 ops/ms 13025.578 13481.477 1.03 GatherOperationsBenchmark.microByteGather256_MASK 1024 thrpt 30 ops/ms 3317.651 3396.208 1.02 GatherOperationsBenchmark.microByteGather256_MASK 4096 thrpt 30 ops/ms 846.0888 864.8407 1.02 GatherOperationsBenchmark.microByteGather256_MASK_NZ_OFF 64 thrpt 30 ops/ms 44488.365 48769.036 1.09 GatherOperationsBenchmark.microByteGather256_MASK_NZ_OFF 256 thrpt 30 ops/ms 11988.552 13326.306 1.11 GatherOperationsBenchmark.microByteGather256_MASK_NZ_OFF 1024 thrpt 30 ops/ms 2851.132 3377.599 1.18 GatherOperationsBenchmark.microByteGather256_MASK_NZ_OFF 4096 thrpt 30 ops/ms 734.368 872.331 1.18 GatherOperationsBenchmark.microByteGather256_NZ_OFF 64 thrpt 30 ops/ms 44716.846 46816.743 1.04 GatherOperationsBenchmark.microByteGather256_NZ_OFF 256 thrpt 30 ops/ms 11885.251 12255.916 1.03 GatherOperationsBenchmark.microByteGather256_NZ_OFF 1024 thrpt 30 ops/ms 3016.645 3096.172 1.02 GatherOperationsBenchmark.microByteGather256_NZ_OFF 4096 thrpt 30 ops/ms 756.903 776.363 1.02 GatherOperationsBenchmark.microByteGather512 64 thrpt 30 ops/ms 44742.221 46848.590 1.04 GatherOperationsBenchmark.microByteGather512 256 thrpt 30 ops/ms 12081.443 12236.973 1.01 GatherOperationsBenchmark.microByteGather512 1024 thrpt 30 ops/ms 3086.873 3088.040 1.00 GatherOperationsBenchmark.microByteGather512 4096 thrpt 30 ops/ms 774.243 770.209 0.99 GatherOperationsBenchmark.microByteGather512_MASK 64 thrpt 30 ops/ms 50588.210 48220.741 0.95 GatherOperationsBenchmark.microByteGather512_MASK 256 thrpt 30 ops/ms 13535.785 13675.499 1.01 GatherOperationsBenchmark.microByteGather512_MASK 1024 thrpt 30 ops/ms 3355.724 3421.323 1.01 GatherOperationsBenchmark.microByteGather512_MASK 4096 thrpt 30 ops/ms 859.103 872.009 1.01 GatherOperationsBenchmark.microByteGather512_MASK_NZ_OFF 64 thrpt 30 ops/ms 44139.269 48320.364 1.09 GatherOperationsBenchmark.microByteGather512_MASK_NZ_OFF 256 thrpt 30 ops/ms 12500.697 13801.124 1.10 GatherOperationsBenchmark.microByteGather512_MASK_NZ_OFF 1024 thrpt 30 ops/ms 3135.082 3492.312 1.11 GatherOperationsBenchmark.microByteGather512_MASK_NZ_OFF 4096 thrpt 30 ops/ms 794.338 897.249 1.12 GatherOperationsBenchmark.microByteGather512_NZ_OFF 64 thrpt 30 ops/ms 45754.147 46421.300 1.01 GatherOperationsBenchmark.microByteGather512_NZ_OFF 256 thrpt 30 ops/ms 12133.467 12253.848 1.00 GatherOperationsBenchmark.microByteGather512_NZ_OFF 1024 thrpt 30 ops/ms 3074.637 3091.207 1.00 GatherOperationsBenchmark.microByteGather512_NZ_OFF 4096 thrpt 30 ops/ms 755.250 774.367 1.02 GatherOperationsBenchmark.microByteGather64 64 thrpt 30 ops/ms 58625.196 59263.141 1.01 GatherOperationsBenchmark.microByteGather64 256 thrpt 30 ops/ms 15745.329 17377.889 1.10 GatherOperationsBenchmark.microByteGather64 1024 thrpt 30 ops/ms 4121.997 4471.261 1.08 GatherOperationsBenchmark.microByteGather64 4096 thrpt 30 ops/ms 1044.419 1125.721 1.07 GatherOperationsBenchmark.microByteGather64_MASK 64 thrpt 30 ops/ms 48754.131 49028.183 1.00 GatherOperationsBenchmark.microByteGather64_MASK 256 thrpt 30 ops/ms 13248.349 13537.811 1.02 GatherOperationsBenchmark.microByteGather64_MASK 1024 thrpt 30 ops/ms 3308.839 3356.109 1.01 GatherOperationsBenchmark.microByteGather64_MASK 4096 thrpt 30 ops/ms 843.688 859.161 1.01 GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 64 thrpt 30 ops/ms 43523.662 48868.373 1.12 GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 256 thrpt 30 ops/ms 12242.984 13519.719 1.10 GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 1024 thrpt 30 ops/ms 3055.772 3394.342 1.11 GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 4096 thrpt 30 ops/ms 754.532 870.302 1.15 GatherOperationsBenchmark.microByteGather64_NZ_OFF 64 thrpt 30 ops/ms 51858.935 58869.325 1.13 GatherOperationsBenchmark.microByteGather64_NZ_OFF 256 thrpt 30 ops/ms 14235.928 17381.117 1.22 GatherOperationsBenchmark.microByteGather64_NZ_OFF 1024 thrpt 30 ops/ms 3684.506 4483.270 1.21 GatherOperationsBenchmark.microByteGather64_NZ_OFF 4096 thrpt 30 ops/ms 922.368 1127.66 1.22 GatherOperationsBenchmark.microShortGather128 64 thrpt 30 ops/ms 44399.870 45016.972 1.01 GatherOperationsBenchmark.microShortGather128 256 thrpt 30 ops/ms 11679.775 12629.207 1.08 GatherOperationsBenchmark.microShortGather128 1024 thrpt 30 ops/ms 1277.328 3206.762 2.51 GatherOperationsBenchmark.microShortGather128 4096 thrpt 30 ops/ms 761.846 817.159 1.07 GatherOperationsBenchmark.microShortGather128_MASK 64 thrpt 30 ops/ms 37165.399 36484.534 0.98 GatherOperationsBenchmark.microShortGather128_MASK 256 thrpt 30 ops/ms 9875.757 9958.754 1.00 GatherOperationsBenchmark.microShortGather128_MASK 1024 thrpt 30 ops/ms 2519.580 2554.210 1.01 GatherOperationsBenchmark.microShortGather128_MASK 4096 thrpt 30 ops/ms 615.867 652.092 1.05 GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 64 thrpt 30 ops/ms 34049.203 33669.772 0.98 GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 256 thrpt 30 ops/ms 9010.587 8779.455 0.97 GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 1024 thrpt 30 ops/ms 2253.432 2415.560 1.07 GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 4096 thrpt 30 ops/ms 559.163 577.659 1.03 GatherOperationsBenchmark.microShortGather128_NZ_OFF 64 thrpt 30 ops/ms 39892.023 43978.899 1.10 GatherOperationsBenchmark.microShortGather128_NZ_OFF 256 thrpt 30 ops/ms 10697.817 12424.189 1.16 GatherOperationsBenchmark.microShortGather128_NZ_OFF 1024 thrpt 30 ops/ms 2681.286 3145.941 1.17 GatherOperationsBenchmark.microShortGather128_NZ_OFF 4096 thrpt 30 ops/ms 682.330 803.364 1.17 GatherOperationsBenchmark.microShortGather256 64 thrpt 30 ops/ms 42335.033 43194.212 1.02 GatherOperationsBenchmark.microShortGather256 256 thrpt 30 ops/ms 10760.015 11149.020 1.03 GatherOperationsBenchmark.microShortGather256 1024 thrpt 30 ops/ms 2688.410 2806.389 1.04 GatherOperationsBenchmark.microShortGather256 4096 thrpt 30 ops/ms 675.401 703.849 1.04 GatherOperationsBenchmark.microShortGather256_MASK 64 thrpt 30 ops/ms 38760.990 41844.197 1.07 GatherOperationsBenchmark.microShortGather256_MASK 256 thrpt 30 ops/ms 11339.217 10951.141 0.96 GatherOperationsBenchmark.microShortGather256_MASK 1024 thrpt 30 ops/ms 2840.081 2718.823 0.95 GatherOperationsBenchmark.microShortGather256_MASK 4096 thrpt 30 ops/ms 725.334 696.343 0.96 GatherOperationsBenchmark.microShortGather256_MASK_NZ_OFF 64 thrpt 30 ops/ms 39059.271 42199.055 1.08 GatherOperationsBenchmark.microShortGather256_MASK_NZ_OFF 256 thrpt 30 ops/ms 10440.036 11467.941 1.09 GatherOperationsBenchmark.microShortGather256_MASK_NZ_OFF 1024 thrpt 30 ops/ms 2563.378 2790.541 1.08 GatherOperationsBenchmark.microShortGather256_MASK_NZ_OFF 4096 thrpt 30 ops/ms 642.642 751.287 1.16 GatherOperationsBenchmark.microShortGather256_NZ_OFF 64 thrpt 30 ops/ms 38963.881 42675.099 1.09 GatherOperationsBenchmark.microShortGather256_NZ_OFF 256 thrpt 30 ops/ms 10628.469 11168.949 1.05 GatherOperationsBenchmark.microShortGather256_NZ_OFF 1024 thrpt 30 ops/ms 2702.591 2806.074 1.03 GatherOperationsBenchmark.microShortGather256_NZ_OFF 4096 thrpt 30 ops/ms 683.690 704.498 1.03 GatherOperationsBenchmark.microShortGather512 64 thrpt 30 ops/ms 41117.094 41269.397 1.00 GatherOperationsBenchmark.microShortGather512 256 thrpt 30 ops/ms 10565.519 10652.618 1.00 GatherOperationsBenchmark.microShortGather512 1024 thrpt 30 ops/ms 2681.894 2705.963 1.00 GatherOperationsBenchmark.microShortGather512 4096 thrpt 30 ops/ms 673.821 679.631 1.00 GatherOperationsBenchmark.microShortGather512_MASK 64 thrpt 30 ops/ms 41318.510 42372.271 1.02 GatherOperationsBenchmark.microShortGather512_MASK 256 thrpt 30 ops/ms 11587.465 10674.598 0.92 GatherOperationsBenchmark.microShortGather512_MASK 1024 thrpt 30 ops/ms 2902.731 2629.739 0.90 GatherOperationsBenchmark.microShortGather512_MASK 4096 thrpt 30 ops/ms 741.546 671.124 0.90 GatherOperationsBenchmark.microShortGather512_MASK_NZ_OFF 64 thrpt 30 ops/ms 39524.127 40623.622 1.02 GatherOperationsBenchmark.microShortGather512_MASK_NZ_OFF 256 thrpt 30 ops/ms 10642.152 11392.025 1.07 GatherOperationsBenchmark.microShortGather512_MASK_NZ_OFF 1024 thrpt 30 ops/ms 2650.143 2819.185 1.06 GatherOperationsBenchmark.microShortGather512_MASK_NZ_OFF 4096 thrpt 30 ops/ms 672.674 739.882 1.09 GatherOperationsBenchmark.microShortGather512_NZ_OFF 64 thrpt 30 ops/ms 39861.745 41600.729 1.04 GatherOperationsBenchmark.microShortGather512_NZ_OFF 256 thrpt 30 ops/ms 10531.312 10586.255 1.00 GatherOperationsBenchmark.microShortGather512_NZ_OFF 1024 thrpt 30 ops/ms 2667.839 2678.026 1.00 GatherOperationsBenchmark.microShortGather512_NZ_OFF 4096 thrpt 30 ops/ms 667.607 677.434 1.01 GatherOperationsBenchmark.microShortGather64 64 thrpt 30 ops/ms 45716.109 50726.590 1.10 GatherOperationsBenchmark.microShortGather64 256 thrpt 30 ops/ms 12383.842 13608.216 1.09 GatherOperationsBenchmark.microShortGather64 1024 thrpt 30 ops/ms 3025.989 3443.097 1.13 GatherOperationsBenchmark.microShortGather64 4096 thrpt 30 ops/ms 771.995 897.890 1.16 GatherOperationsBenchmark.microShortGather64_MASK 64 thrpt 30 ops/ms 39758.975 39155.984 0.98 GatherOperationsBenchmark.microShortGather64_MASK 256 thrpt 30 ops/ms 10594.260 10622.428 1.00 GatherOperationsBenchmark.microShortGather64_MASK 1024 thrpt 30 ops/ms 2654.849 2771.674 1.04 GatherOperationsBenchmark.microShortGather64_MASK 4096 thrpt 30 ops/ms 677.508 684.557 1.01 GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 64 thrpt 30 ops/ms 37729.191 40552.172 1.07 GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 256 thrpt 30 ops/ms 10087.184 11121.611 1.10 GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 1024 thrpt 30 ops/ms 2510.133 2788.778 1.11 GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 4096 thrpt 30 ops/ms 642.370 658.808 1.02 GatherOperationsBenchmark.microShortGather64_NZ_OFF 64 thrpt 30 ops/ms 40632.099 50718.706 1.24 GatherOperationsBenchmark.microShortGather64_NZ_OFF 256 thrpt 30 ops/ms 10984.671 14155.624 1.28 GatherOperationsBenchmark.microShortGather64_NZ_OFF 1024 thrpt 30 ops/ms 2733.285 3668.118 1.34 GatherOperationsBenchmark.microShortGather64_NZ_OFF 4096 thrpt 30 ops/ms 679.524 932.748 1.37 ------------- PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-3004026787