On Fri, 9 May 2025 07:35:41 GMT, Xiaohong Gong <xg...@openjdk.org> wrote:

> JDK-8318650 introduced hotspot intrinsification of subword gather load APIs 
> for X86 platforms [1]. However, the current implementation is not optimal for 
> AArch64 SVE platform, which natively supports vector instructions for subword 
> gather load operations using an int vector for indices (see [2][3]).
> 
> Two key areas require improvement:
> 1. At the Java level, vector indices generated for range validation could be 
> reused for the subsequent gather load operation on architectures with native 
> vector instructions like AArch64 SVE. However, the current implementation 
> prevents compiler reuse of these index vectors due to divergent control flow, 
> potentially impacting performance.
> 2. At the compiler IR level, the additional `offset` input for 
> `LoadVectorGather`/`LoadVectorGatherMasked` with subword types  increases IR 
> complexity and complicates backend implementation. Furthermore, generating 
> `add` instructions before each memory access negatively impacts performance.
> 
> This patch refactors the implementation at both the Java level and compiler 
> mid-end to improve efficiency and maintainability across different 
> architectures.
> 
> Main changes:
> 1. Java-side API refactoring:
>    - Explicitly passes generated index vectors to hotspot, eliminating 
> duplicate index vectors for gather load instructions on
>      architectures like AArch64.
> 2. C2 compiler IR refactoring:
>    - Refactors `LoadVectorGather`/`LoadVectorGatherMasked` IR for subword 
> types by removing the memory offset input and incorporating it into the 
> memory base `addr` at the IR level. This simplifies backend implementation, 
> reduces add operations, and unifies the IR across all types.
> 3. Backend changes:
>    - Streamlines X86 implementation of subword gather operations following 
> the removal of the offset input from the IR level.
> 
> Performance:
> The performance of the relative JMH improves up to 27% on a X86 AVX512 
> system. Please see the data below:
> 
> Benchmark                                                 Mode   Cnt Unit    
> SIZE    Before      After    Gain
> GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  
> 64    53682.012   52650.325  0.98
> GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  
> 256   14484.252   14255.156  0.98
> GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  
> 1024   3664.900    3595.615  0.98
> GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  
> 4096    908.312     935.269  1.02
> GatherOperationsBenchmark.micr...

Hi the above counted loop recognizer patch is merged. Hence I'v rebased this PR 
to latest jdk master. Following is the new performance data of the subword 
gather JMHs on X86:

Benchmark                                                 SIZE Mode   Cnt Unit  
  Before      After    Gain
GatherOperationsBenchmark.microByteGather128                64 thrpt  30  
ops/ms 44221.691  46837.124  1.05
GatherOperationsBenchmark.microByteGather128               256 thrpt  30  
ops/ms 11245.455  12243.045  1.08
GatherOperationsBenchmark.microByteGather128              1024 thrpt  30  
ops/ms  2825.246   3096.460  1.09
GatherOperationsBenchmark.microByteGather128              4096 thrpt  30  
ops/ms   705.927    775.039  1.09
GatherOperationsBenchmark.microByteGather128_MASK           64 thrpt  30  
ops/ms 46783.479  46357.684  0.99
GatherOperationsBenchmark.microByteGather128_MASK          256 thrpt  30  
ops/ms 12810.405  12880.347  1.00
GatherOperationsBenchmark.microByteGather128_MASK         1024 thrpt  30  
ops/ms  3150.320   3239.281  1.02
GatherOperationsBenchmark.microByteGather128_MASK         4096 thrpt  30  
ops/ms   794.151    830.464  1.04
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF    64 thrpt  30  
ops/ms 43189.395  47127.449  1.09
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF   256 thrpt  30  
ops/ms 11543.128  13196.158  1.14
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF  1024 thrpt  30  
ops/ms  2835.053   3300.357  1.16
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF  4096 thrpt  30  
ops/ms   719.470    843.290  1.17
GatherOperationsBenchmark.microByteGather128_NZ_OFF         64 thrpt  30  
ops/ms 44143.887  46836.788  1.06
GatherOperationsBenchmark.microByteGather128_NZ_OFF        256 thrpt  30  
ops/ms 12206.908  12255.677  1.00
GatherOperationsBenchmark.microByteGather128_NZ_OFF       1024 thrpt  30  
ops/ms  3094.232   3095.931  1.00
GatherOperationsBenchmark.microByteGather128_NZ_OFF       4096 thrpt  30  
ops/ms   776.293    774.336  0.99
GatherOperationsBenchmark.microByteGather256                64 thrpt  30  
ops/ms 46247.977  46803.899  1.01
GatherOperationsBenchmark.microByteGather256               256 thrpt  30  
ops/ms 12198.878  12250.315  1.00
GatherOperationsBenchmark.microByteGather256              1024 thrpt  30  
ops/ms  3093.356   3100.107  1.00
GatherOperationsBenchmark.microByteGather256              4096 thrpt  30  
ops/ms   774.611    774.890  1.00
GatherOperationsBenchmark.microByteGather256_MASK           64 thrpt  30  
ops/ms 46873.725  47967.422  1.02
GatherOperationsBenchmark.microByteGather256_MASK          256 thrpt  30  
ops/ms 13025.578  13481.477  1.03
GatherOperationsBenchmark.microByteGather256_MASK         1024 thrpt  30  
ops/ms  3317.651   3396.208  1.02
GatherOperationsBenchmark.microByteGather256_MASK         4096 thrpt  30  
ops/ms  846.0888   864.8407  1.02
GatherOperationsBenchmark.microByteGather256_MASK_NZ_OFF    64 thrpt  30  
ops/ms 44488.365  48769.036  1.09
GatherOperationsBenchmark.microByteGather256_MASK_NZ_OFF   256 thrpt  30  
ops/ms 11988.552  13326.306  1.11
GatherOperationsBenchmark.microByteGather256_MASK_NZ_OFF  1024 thrpt  30  
ops/ms  2851.132   3377.599  1.18
GatherOperationsBenchmark.microByteGather256_MASK_NZ_OFF  4096 thrpt  30  
ops/ms   734.368    872.331  1.18
GatherOperationsBenchmark.microByteGather256_NZ_OFF         64 thrpt  30  
ops/ms 44716.846  46816.743  1.04
GatherOperationsBenchmark.microByteGather256_NZ_OFF        256 thrpt  30  
ops/ms 11885.251  12255.916  1.03
GatherOperationsBenchmark.microByteGather256_NZ_OFF       1024 thrpt  30  
ops/ms  3016.645   3096.172  1.02
GatherOperationsBenchmark.microByteGather256_NZ_OFF       4096 thrpt  30  
ops/ms   756.903    776.363  1.02
GatherOperationsBenchmark.microByteGather512                64 thrpt  30  
ops/ms 44742.221  46848.590  1.04
GatherOperationsBenchmark.microByteGather512               256 thrpt  30  
ops/ms 12081.443  12236.973  1.01
GatherOperationsBenchmark.microByteGather512              1024 thrpt  30  
ops/ms  3086.873   3088.040  1.00
GatherOperationsBenchmark.microByteGather512              4096 thrpt  30  
ops/ms   774.243    770.209  0.99
GatherOperationsBenchmark.microByteGather512_MASK           64 thrpt  30  
ops/ms 50588.210  48220.741  0.95
GatherOperationsBenchmark.microByteGather512_MASK          256 thrpt  30  
ops/ms 13535.785  13675.499  1.01
GatherOperationsBenchmark.microByteGather512_MASK         1024 thrpt  30  
ops/ms  3355.724   3421.323  1.01
GatherOperationsBenchmark.microByteGather512_MASK         4096 thrpt  30  
ops/ms   859.103    872.009  1.01
GatherOperationsBenchmark.microByteGather512_MASK_NZ_OFF    64 thrpt  30  
ops/ms 44139.269  48320.364  1.09
GatherOperationsBenchmark.microByteGather512_MASK_NZ_OFF   256 thrpt  30  
ops/ms 12500.697  13801.124  1.10
GatherOperationsBenchmark.microByteGather512_MASK_NZ_OFF  1024 thrpt  30  
ops/ms  3135.082   3492.312  1.11
GatherOperationsBenchmark.microByteGather512_MASK_NZ_OFF  4096 thrpt  30  
ops/ms   794.338    897.249  1.12
GatherOperationsBenchmark.microByteGather512_NZ_OFF         64 thrpt  30  
ops/ms 45754.147  46421.300  1.01
GatherOperationsBenchmark.microByteGather512_NZ_OFF        256 thrpt  30  
ops/ms 12133.467  12253.848  1.00
GatherOperationsBenchmark.microByteGather512_NZ_OFF       1024 thrpt  30  
ops/ms  3074.637   3091.207  1.00
GatherOperationsBenchmark.microByteGather512_NZ_OFF       4096 thrpt  30  
ops/ms   755.250    774.367  1.02
GatherOperationsBenchmark.microByteGather64                 64 thrpt  30  
ops/ms 58625.196  59263.141  1.01
GatherOperationsBenchmark.microByteGather64                256 thrpt  30  
ops/ms 15745.329  17377.889  1.10
GatherOperationsBenchmark.microByteGather64               1024 thrpt  30  
ops/ms  4121.997   4471.261  1.08
GatherOperationsBenchmark.microByteGather64               4096 thrpt  30  
ops/ms  1044.419   1125.721  1.07
GatherOperationsBenchmark.microByteGather64_MASK            64 thrpt  30  
ops/ms 48754.131  49028.183  1.00
GatherOperationsBenchmark.microByteGather64_MASK           256 thrpt  30  
ops/ms 13248.349  13537.811  1.02
GatherOperationsBenchmark.microByteGather64_MASK          1024 thrpt  30  
ops/ms  3308.839   3356.109  1.01
GatherOperationsBenchmark.microByteGather64_MASK          4096 thrpt  30  
ops/ms   843.688    859.161  1.01
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF     64 thrpt  30  
ops/ms 43523.662  48868.373  1.12
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF    256 thrpt  30  
ops/ms 12242.984  13519.719  1.10
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF   1024 thrpt  30  
ops/ms  3055.772   3394.342  1.11
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF   4096 thrpt  30  
ops/ms   754.532    870.302  1.15
GatherOperationsBenchmark.microByteGather64_NZ_OFF          64 thrpt  30  
ops/ms 51858.935  58869.325  1.13
GatherOperationsBenchmark.microByteGather64_NZ_OFF         256 thrpt  30  
ops/ms 14235.928  17381.117  1.22
GatherOperationsBenchmark.microByteGather64_NZ_OFF        1024 thrpt  30  
ops/ms  3684.506   4483.270  1.21
GatherOperationsBenchmark.microByteGather64_NZ_OFF        4096 thrpt  30  
ops/ms   922.368    1127.66  1.22
GatherOperationsBenchmark.microShortGather128               64 thrpt  30  
ops/ms 44399.870  45016.972  1.01
GatherOperationsBenchmark.microShortGather128              256 thrpt  30  
ops/ms 11679.775  12629.207  1.08
GatherOperationsBenchmark.microShortGather128             1024 thrpt  30  
ops/ms  1277.328   3206.762  2.51
GatherOperationsBenchmark.microShortGather128             4096 thrpt  30  
ops/ms   761.846    817.159  1.07
GatherOperationsBenchmark.microShortGather128_MASK          64 thrpt  30  
ops/ms 37165.399  36484.534  0.98
GatherOperationsBenchmark.microShortGather128_MASK         256 thrpt  30  
ops/ms  9875.757   9958.754  1.00
GatherOperationsBenchmark.microShortGather128_MASK        1024 thrpt  30  
ops/ms  2519.580   2554.210  1.01
GatherOperationsBenchmark.microShortGather128_MASK        4096 thrpt  30  
ops/ms   615.867    652.092  1.05
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF   64 thrpt  30  
ops/ms 34049.203  33669.772  0.98
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF  256 thrpt  30  
ops/ms  9010.587   8779.455  0.97
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 1024 thrpt  30  
ops/ms  2253.432   2415.560  1.07
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 4096 thrpt  30  
ops/ms   559.163    577.659  1.03
GatherOperationsBenchmark.microShortGather128_NZ_OFF        64 thrpt  30  
ops/ms 39892.023  43978.899  1.10
GatherOperationsBenchmark.microShortGather128_NZ_OFF       256 thrpt  30  
ops/ms 10697.817  12424.189  1.16
GatherOperationsBenchmark.microShortGather128_NZ_OFF      1024 thrpt  30  
ops/ms  2681.286   3145.941  1.17
GatherOperationsBenchmark.microShortGather128_NZ_OFF      4096 thrpt  30  
ops/ms   682.330    803.364  1.17
GatherOperationsBenchmark.microShortGather256               64 thrpt  30  
ops/ms 42335.033  43194.212  1.02
GatherOperationsBenchmark.microShortGather256              256 thrpt  30  
ops/ms 10760.015  11149.020  1.03
GatherOperationsBenchmark.microShortGather256             1024 thrpt  30  
ops/ms  2688.410   2806.389  1.04
GatherOperationsBenchmark.microShortGather256             4096 thrpt  30  
ops/ms   675.401    703.849  1.04
GatherOperationsBenchmark.microShortGather256_MASK          64 thrpt  30  
ops/ms 38760.990  41844.197  1.07
GatherOperationsBenchmark.microShortGather256_MASK         256 thrpt  30  
ops/ms 11339.217  10951.141  0.96
GatherOperationsBenchmark.microShortGather256_MASK        1024 thrpt  30  
ops/ms  2840.081   2718.823  0.95
GatherOperationsBenchmark.microShortGather256_MASK        4096 thrpt  30  
ops/ms   725.334    696.343  0.96
GatherOperationsBenchmark.microShortGather256_MASK_NZ_OFF   64 thrpt  30  
ops/ms 39059.271  42199.055  1.08
GatherOperationsBenchmark.microShortGather256_MASK_NZ_OFF  256 thrpt  30  
ops/ms 10440.036  11467.941  1.09
GatherOperationsBenchmark.microShortGather256_MASK_NZ_OFF 1024 thrpt  30  
ops/ms  2563.378   2790.541  1.08
GatherOperationsBenchmark.microShortGather256_MASK_NZ_OFF 4096 thrpt  30  
ops/ms   642.642    751.287  1.16
GatherOperationsBenchmark.microShortGather256_NZ_OFF        64 thrpt  30  
ops/ms 38963.881  42675.099  1.09
GatherOperationsBenchmark.microShortGather256_NZ_OFF       256 thrpt  30  
ops/ms 10628.469  11168.949  1.05
GatherOperationsBenchmark.microShortGather256_NZ_OFF      1024 thrpt  30  
ops/ms  2702.591   2806.074  1.03
GatherOperationsBenchmark.microShortGather256_NZ_OFF      4096 thrpt  30  
ops/ms   683.690    704.498  1.03
GatherOperationsBenchmark.microShortGather512               64 thrpt  30  
ops/ms 41117.094  41269.397  1.00
GatherOperationsBenchmark.microShortGather512              256 thrpt  30  
ops/ms 10565.519  10652.618  1.00
GatherOperationsBenchmark.microShortGather512             1024 thrpt  30  
ops/ms  2681.894   2705.963  1.00
GatherOperationsBenchmark.microShortGather512             4096 thrpt  30  
ops/ms   673.821    679.631  1.00
GatherOperationsBenchmark.microShortGather512_MASK          64 thrpt  30  
ops/ms 41318.510  42372.271  1.02
GatherOperationsBenchmark.microShortGather512_MASK         256 thrpt  30  
ops/ms 11587.465  10674.598  0.92
GatherOperationsBenchmark.microShortGather512_MASK        1024 thrpt  30  
ops/ms  2902.731   2629.739  0.90
GatherOperationsBenchmark.microShortGather512_MASK        4096 thrpt  30  
ops/ms   741.546    671.124  0.90
GatherOperationsBenchmark.microShortGather512_MASK_NZ_OFF   64 thrpt  30  
ops/ms 39524.127  40623.622  1.02
GatherOperationsBenchmark.microShortGather512_MASK_NZ_OFF  256 thrpt  30  
ops/ms 10642.152  11392.025  1.07
GatherOperationsBenchmark.microShortGather512_MASK_NZ_OFF 1024 thrpt  30  
ops/ms  2650.143   2819.185  1.06
GatherOperationsBenchmark.microShortGather512_MASK_NZ_OFF 4096 thrpt  30  
ops/ms   672.674    739.882  1.09
GatherOperationsBenchmark.microShortGather512_NZ_OFF        64 thrpt  30  
ops/ms 39861.745  41600.729  1.04
GatherOperationsBenchmark.microShortGather512_NZ_OFF       256 thrpt  30  
ops/ms 10531.312  10586.255  1.00
GatherOperationsBenchmark.microShortGather512_NZ_OFF      1024 thrpt  30  
ops/ms  2667.839   2678.026  1.00
GatherOperationsBenchmark.microShortGather512_NZ_OFF      4096 thrpt  30  
ops/ms   667.607    677.434  1.01
GatherOperationsBenchmark.microShortGather64                64 thrpt  30  
ops/ms 45716.109  50726.590  1.10
GatherOperationsBenchmark.microShortGather64               256 thrpt  30  
ops/ms 12383.842  13608.216  1.09
GatherOperationsBenchmark.microShortGather64              1024 thrpt  30  
ops/ms  3025.989   3443.097  1.13
GatherOperationsBenchmark.microShortGather64              4096 thrpt  30  
ops/ms   771.995    897.890  1.16
GatherOperationsBenchmark.microShortGather64_MASK           64 thrpt  30  
ops/ms 39758.975  39155.984  0.98
GatherOperationsBenchmark.microShortGather64_MASK          256 thrpt  30  
ops/ms 10594.260  10622.428  1.00
GatherOperationsBenchmark.microShortGather64_MASK         1024 thrpt  30  
ops/ms  2654.849   2771.674  1.04
GatherOperationsBenchmark.microShortGather64_MASK         4096 thrpt  30  
ops/ms   677.508    684.557  1.01
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF    64 thrpt  30  
ops/ms 37729.191  40552.172  1.07
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF   256 thrpt  30  
ops/ms 10087.184  11121.611  1.10
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF  1024 thrpt  30  
ops/ms  2510.133   2788.778  1.11
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF  4096 thrpt  30  
ops/ms   642.370    658.808  1.02
GatherOperationsBenchmark.microShortGather64_NZ_OFF         64 thrpt  30  
ops/ms 40632.099  50718.706  1.24
GatherOperationsBenchmark.microShortGather64_NZ_OFF        256 thrpt  30  
ops/ms 10984.671  14155.624  1.28
GatherOperationsBenchmark.microShortGather64_NZ_OFF       1024 thrpt  30  
ops/ms  2733.285   3668.118  1.34
GatherOperationsBenchmark.microShortGather64_NZ_OFF       4096 thrpt  30  
ops/ms   679.524    932.748  1.37

-------------

PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-3004026787

Reply via email to