https://gcc.gnu.org/g:509e34b08bd2cfa42caf0b8dcc79d23247516648
commit 509e34b08bd2cfa42caf0b8dcc79d23247516648 Author: Michael Meissner <meiss...@linux.ibm.com> Date: Thu Jan 2 18:12:33 2025 -0500 Update ChangeLog.* Diff: --- gcc/ChangeLog.sha | 146 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 146 insertions(+) diff --git a/gcc/ChangeLog.sha b/gcc/ChangeLog.sha index 3ca691489d87..ece851e7aa1f 100644 --- a/gcc/ChangeLog.sha +++ b/gcc/ChangeLog.sha @@ -1,5 +1,151 @@ +==================== Branch work190-sha, patch #400 ==================== + +PR target/117251: Add PowerPC XXEVAL support to speed up SHA3 calculations + +The multibuff.c benchmark attached to the PR target/117251 compiled for Power10 +PowerPC that implement SHA3 has a slowdown in the current trunk and GCC 14 +compared to GCC 11 - GCC 13, due to excessive amounts of spilling. + +The main function for the multibuf.c file has 3,747 lines, all of which are +using vector unsigned long long. There are 696 vector rotates (all rotates are +constant), 1,824 vector xor's and 600 vector andc's. + +In looking at it, the main thing that steps out is the reason for either +spilling or moving variables is the support in fusion.md (generated by +genfusion.pl) that tries to fuse the vec_andc feeding into vec_xor, and other +vec_xor's feeding into vec_xor. + +On the powerpc for power10, there is a special fusion mode that happens if the +machine has a VANDC or VXOR instruction that is adjacent to a VXOR instruction +and the VANDC/VXOR feeds into the 2nd VXOR instruction. + +While the Power10 has 64 vector registers (which uses the XXL prefix to do +logical operations), the fusion only works with the older Altivec instruction +set (which uses the V prefix). The Altivec instruction only has 32 vector +registers (which are overlaid over the VSX vector registers 32-63). + +By having the combiner patterns fuse_vandc_vxor and fuse_vxor_vxor to do this +fusion, it means that the register allocator has more register pressure for the +traditional Altivec registers instead of the VSX registers. + +In addition, since there are vector rotates, these rotates only work on the +traditional Altivec registers, which adds to the Altivec register pressure. + +Finally in addition to doing the explicit xor, andc, and rotates using the +Altivec registers, we have to also load vector constants for the rotate amount +and these registers also are allocated as Altivec registers. + +Current trunk and GCC 12-14 have more vector spills than GCC 11, but GCC 11 has +many more vector moves that the later compilers. Thus even though it has way +less spills, the vector moves are why GCC 11 have the slowest results. + +There is an instruction that was added in power10 (XXEVAL) that does provide +fusion between VSX vectors that includes ANDC->XOR and XOR->XOR fusion. + +The latency of XXEVAL is slightly more than the fused VANDC/VXOR or VXOR/VXOR, +so I have written the patch to prefer doing the Altivec instructions if they +don't need a temporary register. + +Here are the results for adding support for XXEVAL for the multibuff.c +benchmark attached to the PR. Note that we essentially recover the speed with +this patch that were lost with GCC 14 and the current trunk: + + XXEVAL Trunk GCC14 GCC13 GCC12 GCC11 + ------ ----- ----- ----- ----- ----- +Benchmark time in seconds 5.53 6.15 6.26 5.57 5.61 9.56 + +Fuse VANDC -> VXOR 209 600 600 600 600 600 +Fuse VXOR -> VXOR 0 240 240 120 120 120 +XXEVAL to fuse ANDC -> XOR 391 0 0 0 0 0 +XXEVAL to fuse XOR -> XOR 240 0 0 0 0 0 + +Spill vector to stack 78 364 364 172 184 110 +Load spilled vector from stack 431 962 962 713 723 166 +Vector moves 10 100 100 70 72 3,055 + +Vector rotate right 696 696 696 696 696 696 +XXLANDC or VANDC 209 600 600 600 600 600 +XXLXOR or VXOR 953 1,824 1,824 1,824 1,824 1,825 +XXEVAL 631 0 0 0 0 0 + +Load vector rotate constants 24 24 24 24 24 24 + + +Here are the results for adding support for XXEVAL for the singlebuff.c +benchmark attached to the PR. Note that adding XXEVAL greatly speeds up this +particular benchmark: + + XXEVAL Trunk GCC14 GCC13 GCC12 GCC11 + ------ ----- ----- ----- ----- ----- +Benchmark time in seconds 4.46 5.40 5.40 5.35 5.36 7.54 + +Fuse VANDC -> VXOR 210 600 600 600 600 600 +Fuse VXOR -> VXOR 0 240 240 120 120 120 +XXEVAL to fuse ANDC -> XOR 390 0 0 0 0 0 +XXEVAL to fuse XOR -> XOR 240 0 0 0 0 0 + +Spill vector to stack 113 379 379 382 382 63 +Load spilled vector from stack 333 796 796 757 757 68 +Vector moves 34 80 80 119 119 2,409 + +Vector rotate right 696 696 696 696 696 696 +XXLANDC or VANDC 210 600 600 600 600 600 +XXLXOR or VXOR 954 1,824 1,824 1,824 1,824 1,824 +XXEVAL 630 0 0 0 0 0 + +Load vector rotate constants 96 96 96 96 96 96 + + +These patches add the following fusion patterns: + + xxland => xxland xxlandc => xxland xxlxor => xxland + xxlor => xxland xxlnor => xxland xxleqv => xxland + xxlorc => xxland xxlandc => xxlandc xxlnand => xxland + xxlnand => xxlnor xxland => xxlxor xxland => xxlor + xxlandc => xxlxor xxlandc => xxlor xxlorc => xxlnor + xxlorc => xxleqv xxlorc => xxlorc xxleqv => xxlnor + xxlxor => xxlxor xxlxor => xxlor xxlnor => xxlnor + xxlor => xxlxor xxlor => xxlor xxlor => xxlnor + xxlnor => xxlxor xxlnor => xxlor xxlxor => xxlnor + xxleqv => xxlxor xxleqv => xxlor xxlorc => xxlxor + xxlorc => xxlor xxlandc => xxlnor xxlandc => xxleqv + xxland => xxlnor xxlnand => xxlxor xxlnand => xxlor + xxlnand => xxlnand xxlorc => xxlnand xxleqv => xxlnand + xxlnor => xxlnand xxlor => xxlnand xxlxor => xxlnand + xxlandc => xxlnand xxland => xxlnand + + +2025-01-02 Michael Meissner <meiss...@linux.ibm.com> + +gcc/ + + PR target/117251 + * config/rs6000/fusion.md: Regenerate. + * config/rs6000/genfusion.pl (gen_logical_addsubf): Add support to + generate vector/vector logical fusion if XXEVAL supports the fusion. + * config/rs6000/predicates.md (vector_fusion_operand): New predicate. + * config/rs6000/rs6000.cc (rs6000_opt_vars): Add -mxxeval. + * config/rs6000/rs6000.md (isa attribute): Add xxeval. + (enabled attribute): Add support for -mxxeval. + * config/rs6000/rs6000.opt (-mxxeval): New switch. + +gcc/testsuite/ + + PR target/117251 + * gcc.target/powerpc/p10-vector-fused-1.c: New test. + * gcc.target/powerpc/p10-vector-fused-2.c: Likewise. + ==================== Branch work190-sha, baseline ==================== +Add ChangeLog.sha and update REVISION. + +2025-01-02 Michael Meissner <meiss...@linux.ibm.com> + +gcc/ + + * ChangeLog.sha: New file for branch. + * REVISION: Update. + 2025-01-02 Michael Meissner <meiss...@linux.ibm.com> Clone branch