History: This is version 2 of the patch. In the original patch, all 44
fusion opportunities were lumped together in one patch. Outside of
fusion.md, these changes are fairly small, in that it adds one
alternative to each of the fusion patterns to add xxeval support.
Fusion.md is a generated file (created from genfusion.md) that does all
of the fusion combinations. Because of these automated changes,
fusion.md had 265 lines that were deleted and 397 lines that were
added.
In version 2 of the patch, I broke the original patch into 45 separate
patches. The first patch adds the basic support to genfusion.pl,
predicates.md, rs6000.h, and rs6000.md. The first patch adds the first
fusion case (vector 'AND' fusing into vector 'AND'). The next 43
patches each add one more fusion case. Then the last case adds the two
test cases.
The multibuff.c benchmark attached to the PR target/117251 compiled for
Power10 PowerPC that implement SHA3 has a slowdown in the current trunk
and GCC 14 compared to GCC 11 - GCC 13, due to excessive amounts of
spilling.
The main function for the multibuf.c file has 3,747 lines, all of which
are using vector unsigned long long. There are 696 vector rotates (all
rotates are constant), 1,824 vector xor's and 600 vector andc's.
In looking at it, the main thing that steps out is the reason for
either spilling or moving variables is the support in fusion.md
(generated by genfusion.pl) that tries to fuse the vec_andc feeding
into vec_xor, and other vec_xor's feeding into vec_xor.
On the powerpc for power10, there is a special fusion mode that happens
if the machine has a VANDC or VXOR instruction that is adjacent to a
VXOR instruction and the VANDC/VXOR feeds into the 2nd VXOR
instruction.
While the Power10 has 64 vector registers (which uses the XXL prefix to
do logical operations), the fusion only works with the older Altivec
instruction set (which uses the V prefix). The Altivec instruction
only has 32 vector registers (which are overlaid over the VSX vector
registers 32-63).
By having the combiner patterns fuse_vandc_vxor and fuse_vxor_vxor to
do this fusion, it means that the register allocator has more register
pressure for the traditional Altivec registers instead of the VSX
registers.
In addition, since there are vector rotates, these rotates only work on
the traditional Altivec registers, which adds to the Altivec register
pressure.
Finally in addition to doing the explicit xor, andc, and rotates using
the Altivec registers, we have to also load vector constants for the
rotate amount and these registers also are allocated as Altivec
registers.
Current trunk and GCC 12-14 have more vector spills than GCC 11, but
GCC 11 has many more vector moves that the later compilers. Thus even
though it has way less spills, the vector moves are why GCC 11 have the
slowest results.
There is an instruction that was added in power10 (XXEVAL) that does
provide fusion between VSX vectors that includes ANDC->XOR and XOR->XOR
fusion.
The latency of XXEVAL is slightly more than the fused VANDC/VXOR or
VXOR/VXOR, so I have written the patch to prefer doing the Altivec
instructions if they don't need a temporary register.
Here are the results for adding support for XXEVAL for the multibuff.c
benchmark attached to the PR. Note that we essentially recover the
speed with this patch that were lost with GCC 14 and the current trunk:
XXEVAL Trunk GCC15 GCC14 GCC13
------ ----- ----- ----- -----
Multibuf time in seconds 5.600 6.151 6.129 6.053 5.539
XXEVAL improvement percentage --- +9.8% +9.4% +8.1% -1.1%
Fuse VANDC -> VXOR 209 600 600 600 600
Fuse VXOR -> VXOR 0 241 241 240 120
XXEVAL to fuse ANDC -> XOR (#45) 391 0 0 0 0
XXEVAL to fuse XOR -> XOR (#105) 240 0 0 0 0
Spill vector to stack 140 417 417 403 226
Load spilled vector from stack 490 1,012 1,012 1,000 766
Vector moves 8 93 100 70 72
XXLANDC or VANDC 209 600 600 600 600
XXLXOR or VXOR 953 1,824 1,824 1,824 1,824
XXEVAL 631 0 0 0 0
Here are the results for adding support for XXEVAL for the singlebuff.c
benchmark attached to the PR. Note that adding XXEVAL greatly speeds
up this particular benchmark:
XXEVAL Trunk GCC15 GCC14 GCC13
------ ----- ----- ----- -----
Singlebuf time in seconds 4.429 5.330 5.333 5.315 5.270
XXEVAL improvement percentage --- +20.3% +20.4% +20.0% +19.0%
Fuse VANDC -> VXOR 210 600 600 600 600
Fuse VXOR -> VXOR 0 240 240 240 120
XXEVAL to fuse ANDC -> XOR (#45) 390 0 0 0 0
XXEVAL to fuse XOR -> XOR (#105) 240 0 0 0 0
Spill vector to stack 134 388 388 388 391
Load spilled vector from stack 357 808 808 808 769
Vector moves 34 80 80 80 119
XXLANDC or VANDC 210 600 600 600 600
XXLXOR or VXOR 954 1,824 1,824 1,824 1,824
XXEVAL 630 0 0 0 0
These patches add the following fusion patterns:
xxland => xxland xxlandc => xxland
xxlxor => xxland xxlor => xxland
xxlnor => xxland xxleqv => xxland
xxlorc => xxland xxlandc => xxlandc
xxlnand => xxland xxlnand => xxlnor
xxland => xxlxor xxland => xxlor
xxlandc => xxlxor xxlandc => xxlor
xxlorc => xxlnor xxlorc => xxleqv
xxlorc => xxlorc xxleqv => xxlnor
xxlxor => xxlxor xxlxor => xxlor
xxlnor => xxlnor xxlor => xxlxor
xxlor => xxlor xxlor => xxlnor
xxlnor => xxlxor xxlnor => xxlor
xxlxor => xxlnor xxleqv => xxlxor
xxleqv => xxlor xxlorc => xxlxor
xxlorc => xxlor xxlandc => xxlnor
xxlandc => xxleqv xxland => xxlnor
xxlnand => xxlxor xxlnand => xxlor
xxlnand => xxlnand xxlorc => xxlnand
xxleqv => xxlnand xxlnor => xxlnand
xxlor => xxlnand xxlxor => xxlnand
xxlandc => xxlnand xxland => xxlnand
I have committed all of the patches in my backlog (dense math registers, other
-mcpu=future instructions, random bug fixes, support for _Float16 and
__bfloat16, and optimizations for vector logical operations on power10/power11)
into the IBM vendor branch:
vendors/ibm/gcc-17-future
--
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: [email protected]