This patch adds intrinsic support for UMIN and UMAX reduction operations in the 
Vector API on AArch64, enabling direct hardware instruction mapping for better 
performance.

Changes:
--------

1. C2 mid-end:
   - Added UMinReductionVNode and UMaxReductionVNode

2. AArch64 Backend:
   - Added uminp/umaxp/sve_uminv/sve_umaxv instructions
   - Updated match rules for all vector sizes and element types
   - Both NEON and SVE implementation are supported

3. Test:
   - Added UMIN_REDUCTION_V and UMAX_REDUCTION_V to IRNode.java
   - Added assembly tests in aarch64-asmtest.py for new instructions
   - Added a JTReg test file VectorUMinMaxReductionTest.java

Different configurations were tested on aarch64 and x86 machines, and all tests 
passed.

Test results of JMH benchmarks from the panama-vector project:
--------

On a Nvidia Grace machine with 128-bit SVE:

Benchmark                       Unit    Before  Error   After           Error   
Uplift
Byte128Vector.UMAXLanes         ops/ms  411.60  42.18   25226.51        33.92   
61.29
Byte128Vector.UMAXMaskedLanes   ops/ms  558.56  85.12   25182.90        28.74   
45.09
Byte128Vector.UMINLanes         ops/ms  645.58  780.76  28396.29        103.11  
43.99
Byte128Vector.UMINMaskedLanes   ops/ms  621.09  718.27  26122.62        42.68   
42.06
Byte64Vector.UMAXLanes          ops/ms  296.33  34.44   14357.74        15.95   
48.45
Byte64Vector.UMAXMaskedLanes    ops/ms  376.54  44.01   14269.24        21.41   
37.90
Byte64Vector.UMINLanes          ops/ms  373.45  426.51  15425.36        66.20   
41.31
Byte64Vector.UMINMaskedLanes    ops/ms  353.32  346.87  14201.37        13.79   
40.19
Int128Vector.UMAXLanes          ops/ms  174.79  192.51  9906.07         286.93  
56.67
Int128Vector.UMAXMaskedLanes    ops/ms  157.23  206.68  10246.77        11.44   
65.17
Int64Vector.UMAXLanes           ops/ms  95.30   126.49  4719.30         98.57   
49.52
Int64Vector.UMAXMaskedLanes     ops/ms  88.19   87.44   4693.18         19.76   
53.22
Long128Vector.UMAXLanes         ops/ms  80.62   97.82   5064.01         35.52   
62.82
Long128Vector.UMAXMaskedLanes   ops/ms  78.15   102.91  5028.24         8.74    
64.34
Long64Vector.UMAXLanes          ops/ms  47.56   62.01   46.76           52.28   
0.98
Long64Vector.UMAXMaskedLanes    ops/ms  45.44   46.76   45.79           42.91   
1.01
Short128Vector.UMAXLanes        ops/ms  316.65  410.30  14814.82        23.65   
46.79
Short128Vector.UMAXMaskedLanes  ops/ms  308.90  351.78  15155.26        31.03   
49.06
Short64Vector.UMAXLanes         ops/ms  190.38  245.09  8022.46         14.30   
42.14
Short64Vector.UMAXMaskedLanes   ops/ms  195.54  36.15   7930.28         11.88   
40.56


On a Nvidia Grace machine with 128-bit NEON:

Benchmark                       Unit    Before  Error   After           Error   
Uplift
Byte128Vector.UMAXLanes         ops/ms  414.69  42.52   25257.61        25.91   
60.91
Byte128Vector.UMAXMaskedLanes   ops/ms  552.00  56.61   23063.14        304.45  
41.78
Byte128Vector.UMINLanes         ops/ms  634.98  849.04  28444.37        180.80  
44.80
Byte128Vector.UMINMaskedLanes   ops/ms  612.88  735.18  26127.07        27.99   
42.63
Byte64Vector.UMAXLanes          ops/ms  291.53  32.19   13893.62        28.09   
47.66
Byte64Vector.UMAXMaskedLanes    ops/ms  363.34  48.17   13290.59        12.53   
36.58
Byte64Vector.UMINLanes          ops/ms  368.70  433.60  15416.90        15.80   
41.81
Byte64Vector.UMINMaskedLanes    ops/ms  350.46  371.05  14524.29        121.63  
41.44
Int128Vector.UMAXLanes          ops/ms  177.67  201.38  10182.82        20.21   
57.31
Int128Vector.UMAXMaskedLanes    ops/ms  155.25  187.88  9194.13         393.35  
59.22
Int64Vector.UMAXLanes           ops/ms  93.93   115.02  5106.79         4.54    
54.37
Int64Vector.UMAXMaskedLanes     ops/ms  87.01   88.50   4405.87         8.06    
50.63
Long128Vector.UMAXLanes         ops/ms  80.32   98.50   3229.80         40.53   
40.21
Long128Vector.UMAXMaskedLanes   ops/ms  77.65   103.25  3161.50         4.45    
40.72
Long64Vector.UMAXLanes          ops/ms  47.72   65.38   46.41           50.38   
0.97
Long64Vector.UMAXMaskedLanes    ops/ms  45.26   47.46   45.13           47.23   
1.00
Short128Vector.UMAXLanes        ops/ms  316.09  429.34  14748.07        14.78   
46.66
Short128Vector.UMAXMaskedLanes  ops/ms  307.70  342.54  14359.11        44.99   
46.67
Short64Vector.UMAXLanes         ops/ms  187.67  253.01  8180.63         178.65  
43.59
Short64Vector.UMAXMaskedLanes   ops/ms  191.10  33.51   7949.19         108.65  
41.60

-------------

Commit messages:
 - 8372980: [VectorAPI] AArch64: Add intrinsic support for unsigned min/max 
reduction operations
 - 8372978: [VectorAPI] Fix incorrect identity values in UMIN/UMAX reductions

Changes: https://git.openjdk.org/jdk/pull/28693/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=28693&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8372980
  Stats: 1607 lines in 49 files changed: 835 ins; 16 del; 756 mod
  Patch: https://git.openjdk.org/jdk/pull/28693.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/28693/head:pull/28693

PR: https://git.openjdk.org/jdk/pull/28693

Reply via email to