https://gcc.gnu.org/g:a8ecc1a3ff1faece3363d50d7c501123c2be6a5b
commit a8ecc1a3ff1faece3363d50d7c501123c2be6a5b Author: Michael Meissner <meiss...@linux.ibm.com> Date: Thu Oct 24 12:26:28 2024 -0400 Update ChangeLog.* Diff: --- gcc/ChangeLog.sha | 143 +++++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 126 insertions(+), 17 deletions(-) diff --git a/gcc/ChangeLog.sha b/gcc/ChangeLog.sha index fe43d0cb19a8..de75ac6f0e81 100644 --- a/gcc/ChangeLog.sha +++ b/gcc/ChangeLog.sha @@ -1,18 +1,8 @@ -==================== Branch work182-sha, patch #402 ==================== - -Add missing test. - -2024-10-16 Michael Meissner <meiss...@linux.ibm.com> - -gcc/testsuite/ - - * gcc.target/powerpc/vector-rotate-left.c: New test. - -==================== Branch work182-sha, patch #401 ==================== +==================== Branch work182-sha, patch #411 was reverted ==================== Add potential p-future XVRLD and XVRLDI instructions. -2024-10-16 Michael Meissner <meiss...@linux.ibm.com> +2024-10-24 Michael Meissner <meiss...@linux.ibm.com> gcc/ @@ -24,11 +14,128 @@ gcc/ * config/rs6000/rs6000.md (isa attribute): Add xvrlw. (enabled attribute): Add support for xvrlw. -==================== Branch work182-sha, patch #400 ==================== +gcc/testsuite/ + + * gcc.target/powerpc/vector-rotate-left.c: New test. + +==================== Branch work182-sha, patch #410 was reverted ==================== + +PR target/117251: Add PowerPC XXEVAL support to speed up SHA3 calculations + +The multibuff.c benchmark attached to the PR target/117251 compiled for Power10 +PowerPC that implement SHA3 has a slowdown in the current trunk and GCC 14 +compared to GCC 11 - GCC 13, due to excessive amounts of spilling. + +The main function for the multibuf.c file has 3,747 lines, all of which are +using vector unsigned long long. There are 696 vector rotates (all rotates are +constant), 1,824 vector xor's and 600 vector andc's. + +In looking at it, the main thing that steps out is the reason for either +spilling or moving variables is the support in fusion.md (generated by +genfusion.pl) that tries to fuse the vec_andc feeding into vec_xor, and other +vec_xor's feeding into vec_xor. + +On the powerpc for power10, there is a special fusion mode that happens if the +machine has a VANDC or VXOR instruction that is adjacent to a VXOR instruction +and the VANDC/VXOR feeds into the 2nd VXOR instruction. + +While the Power10 has 64 vector registers (which uses the XXL prefix to do +logical operations), the fusion only works with the older Altivec instruction +set (which uses the V prefix). The Altivec instruction only has 32 vector +registers (which are overlaid over the VSX vector registers 32-63). + +By having the combiner patterns fuse_vandc_vxor and fuse_vxor_vxor to do this +fusion, it means that the register allocator has more register pressure for the +traditional Altivec registers instead of the VSX registers. + +In addition, since there are vector rotates, these rotates only work on the +traditional Altivec registers, which adds to the Altivec register pressure. + +Finally in addition to doing the explicit xor, andc, and rotates using the +Altivec registers, we have to also load vector constants for the rotate amount +and these registers also are allocated as Altivec registers. -Initial support for adding xxeval fusion support. +Current trunk and GCC 12-14 have more vector spills than GCC 11, but GCC 11 has +many more vector moves that the later compilers. Thus even though it has way +less spills, the vector moves are why GCC 11 have the slowest results. -2024-10-16 Michael Meissner <meiss...@linux.ibm.com> +There is an instruction that was added in power10 (XXEVAL) that does provide +fusion between VSX vectors that includes ANDC->XOR and XOR->XOR fusion. + +The latency of XXEVAL is slightly more than the fused VANDC/VXOR or VXOR/VXOR, +so I have written the patch to prefer doing the Altivec instructions if they +don't need a temporary register. + +Here are the results for adding support for XXEVAL for the multibuff.c +benchmark attached to the PR. Note that we essentially recover the speed with +this patch that were lost with GCC 14 and the current trunk: + + XXEVAL Trunk GCC14 GCC13 GCC12 GCC11 + ------ ----- ----- ----- ----- ----- +Benchmark time in seconds 5.53 6.15 6.26 5.57 5.61 9.56 + +Fuse VANDC -> VXOR 209 600 600 600 600 600 +Fuse VXOR -> VXOR 0 240 240 120 120 120 +XXEVAL to fuse ANDC -> XOR 391 0 0 0 0 0 +XXEVAL to fuse XOR -> XOR 240 0 0 0 0 0 + +Spill vector to stack 78 364 364 172 184 110 +Load spilled vector from stack 431 962 962 713 723 166 +Vector moves 10 100 100 70 72 3,055 + +Vector rotate right 696 696 696 696 696 696 +XXLANDC or VANDC 209 600 600 600 600 600 +XXLXOR or VXOR 953 1,824 1,824 1,824 1,824 1,825 +XXEVAL 631 0 0 0 0 0 + +Load vector rotate constants 24 24 24 24 24 24 + + +Here are the results for adding support for XXEVAL for the singlebuff.c +benchmark attached to the PR. Note that adding XXEVAL greatly speeds up this +particular benchmark: + + XXEVAL Trunk GCC14 GCC13 GCC12 GCC11 + ------ ----- ----- ----- ----- ----- +Benchmark time in seconds 4.46 5.40 5.40 5.35 5.36 7.54 + +Fuse VANDC -> VXOR 210 600 600 600 600 600 +Fuse VXOR -> VXOR 0 240 240 120 120 120 +XXEVAL to fuse ANDC -> XOR 390 0 0 0 0 0 +XXEVAL to fuse XOR -> XOR 240 0 0 0 0 0 + +Spill vector to stack 113 379 379 382 382 63 +Load spilled vector from stack 333 796 796 757 757 68 +Vector moves 34 80 80 119 119 2,409 + +Vector rotate right 696 696 696 696 696 696 +XXLANDC or VANDC 210 600 600 600 600 600 +XXLXOR or VXOR 954 1,824 1,824 1,824 1,824 1,824 +XXEVAL 630 0 0 0 0 0 + +Load vector rotate constants 96 96 96 96 96 96 + + +These patches to add XXEVAL support add the following fusion patterns: + + xxland => xxland xxlandc => xxland xxlxor => xxland + xxlor => xxland xxlnor => xxland xxleqv => xxland + xxlorc => xxland xxlandc => xxlandc xxlnand => xxland + xxlnand => xxlnor xxland => xxlxor xxland => xxlor + xxlandc => xxlxor xxlandc => xxlor xxlorc => xxlnor + xxlorc => xxleqv xxlorc => xxlorc xxleqv => xxlnor + xxlxor => xxlxor xxlxor => xxlor xxlnor => xxlnor + xxlor => xxlxor xxlor => xxlor xxlor => xxlnor + xxlnor => xxlxor xxlnor => xxlor xxlxor => xxlnor + xxleqv => xxlxor xxleqv => xxlor xxlorc => xxlxor + xxlorc => xxlor xxlandc => xxlnor xxlandc => xxleqv + xxland => xxlnor xxlnand => xxlxor xxlnand => xxlor + xxlnand => xxlnand xxlorc => xxlnand xxleqv => xxlnand + xxlnor => xxlnand xxlor => xxlnand xxlxor => xxlnand + xxlandc => xxlnand xxland => xxlnand + + +2024-10-24 Michael Meissner <meiss...@linux.ibm.com> gcc/ @@ -47,8 +154,10 @@ gcc/testsuite/ PR target/117251 * gcc.target/powerpc/p10-vector-fused-1.c: New test. * gcc.target/powerpc/p10-vector-fused-2.c: Likewise. - * gcc.target/powerpc/xxeval-1.c: Likewise. - * gcc.target/powerpc/xxeval-2.c: Likewise. + +==================== Branch work182-sha, patch #402 was reverted ==================== +==================== Branch work182-sha, patch #401 was reverted ==================== +==================== Branch work182-sha, patch #400 was reverted ==================== ==================== Branch work182-sha, baseline ====================