History: This is version 2 of the patch. In the original patch, all 44 fusion
opportunities were lumped together in one patch. Outside of fusion.md, these
changes are fairly small, in that it adds one alternative to each of the fusion
patterns to add xxeval support. Fusion.md is a generated file (created from
genfusion.md) that does all of the fusion combinations. Because of these
automated changes, fusion.md had 265 lines that were deleted and 397 lines that
were added.
In version 2 of the patch, I broke the original patch into 45 separate patches.
The first patch adds the basic support to genfusion.pl, predicates.md, rs6000.h,
and rs6000.md. The first patch adds the first fusion case (vector 'AND' fusing
into vector 'AND'). The next 43 patches each add one more fusion case. Then the
last case adds the two test cases.
The multibuff.c benchmark attached to the PR target/117251 compiled for Power10
PowerPC that implement SHA3 has a slowdown in the current trunk and GCC 14
compared to GCC 11 - GCC 13, due to excessive amounts of spilling.
The main function for the multibuf.c file has 3,747 lines, all of which are
using vector unsigned long long. There are 696 vector rotates (all rotates are
constant), 1,824 vector xor's and 600 vector andc's.
In looking at it, the main thing that steps out is the reason for either
spilling or moving variables is the support in fusion.md (generated by
genfusion.pl) that tries to fuse the vec_andc feeding into vec_xor, and other
vec_xor's feeding into vec_xor.
On the powerpc for power10, there is a special fusion mode that happens if the
machine has a VANDC or VXOR instruction that is adjacent to a VXOR instruction
and the VANDC/VXOR feeds into the 2nd VXOR instruction.
While the Power10 has 64 vector registers (which uses the XXL prefix to do
logical operations), the fusion only works with the older Altivec instruction
set (which uses the V prefix). The Altivec instruction only has 32 vector
registers (which are overlaid over the VSX vector registers 32-63).
By having the combiner patterns fuse_vandc_vxor and fuse_vxor_vxor to do this
fusion, it means that the register allocator has more register pressure for the
traditional Altivec registers instead of the VSX registers.
In addition, since there are vector rotates, these rotates only work on the
traditional Altivec registers, which adds to the Altivec register pressure.
Finally in addition to doing the explicit xor, andc, and rotates using the
Altivec registers, we have to also load vector constants for the rotate amount
and these registers also are allocated as Altivec registers.
Current trunk and GCC 12-14 have more vector spills than GCC 11, but GCC 11 has
many more vector moves that the later compilers. Thus even though it has way
less spills, the vector moves are why GCC 11 have the slowest results.
There is an instruction that was added in power10 (XXEVAL) that does provide
fusion between VSX vectors that includes ANDC->XOR and XOR->XOR fusion.
The latency of XXEVAL is slightly more than the fused VANDC/VXOR or VXOR/VXOR,
so I have written the patch to prefer doing the Altivec instructions if they
don't need a temporary register.
Here are the results for adding support for XXEVAL for the multibuff.c
benchmark attached to the PR. Note that we essentially recover the speed with
this patch that were lost with GCC 14 and the current trunk:
XXEVAL Trunk GCC15 GCC14 GCC13 GCC12
------ ----- ----- ----- ----- -----
Multibuf time in seconds 5.600 6.151 6.129 6.053 5.539 5.598
XXEVAL improvement percentage --- +9.8% +9.4% +8.1% -1.1% 0%
Fuse VANDC -> VXOR 209 600 600 600 600 600
Fuse VXOR -> VXOR 0 241 241 240 120 120
XXEVAL to fuse ANDC -> XOR (#45) 391 0 0 0 0 0
XXEVAL to fuse XOR -> XOR (#105) 240 0 0 0 0 0
Spill vector to stack 140 417 417 403 226 239
Load spilled vector from stack 490 1,012 1,012 1,000 766 782
Vector moves 8 93 100 70 72 72
XXLANDC or VANDC 209 600 600 600 600 600
XXLXOR or VXOR 953 1,824 1,824 1,824 1,824 1,825
XXEVAL 631 0 0 0 0 0
Here are the results for adding support for XXEVAL for the singlebuff.c
benchmark attached to the PR. Note that adding XXEVAL greatly speeds up this
particular benchmark:
XXEVAL Trunk GCC15 GCC14 GCC13 GCC12
------ ----- ----- ----- ----- -----
Singlebuf time in seconds 4.429 5.330 5.333 5.315 5.270 5.278
XXEVAL improvement percentage --- +20.3% +20.4% +20.0% +19.0% +19.2%
Fuse VANDC -> VXOR 210 600 600 600 600 600
Fuse VXOR -> VXOR 0 240 240 240 120 120
XXEVAL to fuse ANDC -> XOR (#45) 390 0 0 0 0 0
XXEVAL to fuse XOR -> XOR (#105) 240 0 0 0 0 0
Spill vector to stack 134 388 388 388 391 391
Load spilled vector from stack 357 808 808 808 769 769
Vector moves 34 80 80 80 119 119
XXLANDC or VANDC 210 600 600 600 600 600
XXLXOR or VXOR 954 1,824 1,824 1,824 1,824 1,824
XXEVAL 630 0 0 0 0 0
These patches add the following fusion patterns:
xxland => xxland xxlandc => xxland xxlxor => xxland
xxlor => xxland xxlnor => xxland xxleqv => xxland
xxlorc => xxland xxlandc => xxlandc xxlnand => xxland
xxlnand => xxlnor xxland => xxlxor xxland => xxlor
xxlandc => xxlxor xxlandc => xxlor xxlorc => xxlnor
xxlorc => xxleqv xxlorc => xxlorc xxleqv => xxlnor
xxlxor => xxlxor xxlxor => xxlor xxlnor => xxlnor
xxlor => xxlxor xxlor => xxlor xxlor => xxlnor
xxlnor => xxlxor xxlnor => xxlor xxlxor => xxlnor
xxleqv => xxlxor xxleqv => xxlor xxlorc => xxlxor
xxlorc => xxlor xxlandc => xxlnor xxlandc => xxleqv
xxland => xxlnor xxlnand => xxlxor xxlnand => xxlor
xxlnand => xxlnand xxlorc => xxlnand xxleqv => xxlnand
xxlnor => xxlnand xxlor => xxlnand xxlxor => xxlnand
xxlandc => xxlnand xxland => xxlnand
--
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: [email protected]