https://gcc.gnu.org/g:a8ecc1a3ff1faece3363d50d7c501123c2be6a5b

commit a8ecc1a3ff1faece3363d50d7c501123c2be6a5b
Author: Michael Meissner <meiss...@linux.ibm.com>
Date:   Thu Oct 24 12:26:28 2024 -0400

    Update ChangeLog.*

Diff:
---
 gcc/ChangeLog.sha | 143 +++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 126 insertions(+), 17 deletions(-)

diff --git a/gcc/ChangeLog.sha b/gcc/ChangeLog.sha
index fe43d0cb19a8..de75ac6f0e81 100644
--- a/gcc/ChangeLog.sha
+++ b/gcc/ChangeLog.sha
@@ -1,18 +1,8 @@
-==================== Branch work182-sha, patch #402 ====================
-
-Add missing test.
-
-2024-10-16  Michael Meissner  <meiss...@linux.ibm.com>
-
-gcc/testsuite/
-
-       * gcc.target/powerpc/vector-rotate-left.c: New test.
-
-==================== Branch work182-sha, patch #401 ====================
+==================== Branch work182-sha, patch #411 was reverted 
====================
 
 Add potential p-future XVRLD and XVRLDI instructions.
 
-2024-10-16  Michael Meissner  <meiss...@linux.ibm.com>
+2024-10-24  Michael Meissner  <meiss...@linux.ibm.com>
 
 gcc/
 
@@ -24,11 +14,128 @@ gcc/
        * config/rs6000/rs6000.md (isa attribute): Add xvrlw.
        (enabled attribute): Add support for xvrlw.
 
-==================== Branch work182-sha, patch #400 ====================
+gcc/testsuite/
+
+       * gcc.target/powerpc/vector-rotate-left.c: New test.
+
+==================== Branch work182-sha, patch #410 was reverted 
====================
+
+PR target/117251: Add PowerPC XXEVAL support to speed up SHA3 calculations
+
+The multibuff.c benchmark attached to the PR target/117251 compiled for Power10
+PowerPC that implement SHA3 has a slowdown in the current trunk and GCC 14
+compared to GCC 11 - GCC 13, due to excessive amounts of spilling.
+
+The main function for the multibuf.c file has 3,747 lines, all of which are
+using vector unsigned long long.  There are 696 vector rotates (all rotates are
+constant), 1,824 vector xor's and 600 vector andc's.
+
+In looking at it, the main thing that steps out is the reason for either
+spilling or moving variables is the support in fusion.md (generated by
+genfusion.pl) that tries to fuse the vec_andc feeding into vec_xor, and other
+vec_xor's feeding into vec_xor.
+
+On the powerpc for power10, there is a special fusion mode that happens if the
+machine has a VANDC or VXOR instruction that is adjacent to a VXOR instruction
+and the VANDC/VXOR feeds into the 2nd VXOR instruction.
+
+While the Power10 has 64 vector registers (which uses the XXL prefix to do
+logical operations), the fusion only works with the older Altivec instruction
+set (which uses the V prefix).  The Altivec instruction only has 32 vector
+registers (which are overlaid over the VSX vector registers 32-63).
+
+By having the combiner patterns fuse_vandc_vxor and fuse_vxor_vxor to do this
+fusion, it means that the register allocator has more register pressure for the
+traditional Altivec registers instead of the VSX registers.
+
+In addition, since there are vector rotates, these rotates only work on the
+traditional Altivec registers, which adds to the Altivec register pressure.
+
+Finally in addition to doing the explicit xor, andc, and rotates using the
+Altivec registers, we have to also load vector constants for the rotate amount
+and these registers also are allocated as Altivec registers.
 
-Initial support for adding xxeval fusion support.
+Current trunk and GCC 12-14 have more vector spills than GCC 11, but GCC 11 has
+many more vector moves that the later compilers.  Thus even though it has way
+less spills, the vector moves are why GCC 11 have the slowest results.
 
-2024-10-16  Michael Meissner  <meiss...@linux.ibm.com>
+There is an instruction that was added in power10 (XXEVAL) that does provide
+fusion between VSX vectors that includes ANDC->XOR and XOR->XOR fusion.
+
+The latency of XXEVAL is slightly more than the fused VANDC/VXOR or VXOR/VXOR,
+so I have written the patch to prefer doing the Altivec instructions if they
+don't need a temporary register.
+
+Here are the results for adding support for XXEVAL for the multibuff.c
+benchmark attached to the PR.  Note that we essentially recover the speed with
+this patch that were lost with GCC 14 and the current trunk:
+
+                              XXEVAL    Trunk   GCC14   GCC13   GCC12    GCC11
+                              ------    -----   -----   -----   -----    -----
+Benchmark time in seconds       5.53     6.15    6.26    5.57    5.61     9.56
+
+Fuse VANDC -> VXOR               209     600      600     600     600       600
+Fuse VXOR -> VXOR                  0     240      240     120     120       120
+XXEVAL to fuse ANDC -> XOR       391       0        0       0       0         0
+XXEVAL to fuse XOR -> XOR        240       0        0       0       0         0
+
+Spill vector to stack             78     364      364     172     184       110
+Load spilled vector from stack   431     962      962     713     723       166
+Vector moves                      10     100      100      70      72     3,055
+
+Vector rotate right              696     696      696     696     696       696
+XXLANDC or VANDC                 209     600      600     600     600       600
+XXLXOR or VXOR                   953   1,824    1,824   1,824   1,824     1,825
+XXEVAL                           631       0        0       0       0         0
+
+Load vector rotate constants      24      24       24      24      24        24
+
+
+Here are the results for adding support for XXEVAL for the singlebuff.c
+benchmark attached to the PR.  Note that adding XXEVAL greatly speeds up this
+particular benchmark:
+
+                              XXEVAL    Trunk   GCC14   GCC13   GCC12    GCC11
+                              ------    -----   -----   -----   -----    -----
+Benchmark time in seconds       4.46     5.40    5.40    5.35    5.36     7.54
+
+Fuse VANDC -> VXOR               210      600     600     600     600      600
+Fuse VXOR -> VXOR                  0      240     240     120     120      120
+XXEVAL to fuse ANDC -> XOR       390        0       0       0      0         0
+XXEVAL to fuse XOR -> XOR        240        0       0       0      0         0
+
+Spill vector to stack            113      379     379     382    382        63
+Load spilled vector from stack   333      796     796     757    757        68
+Vector moves                      34       80      80     119    119     2,409
+
+Vector rotate right              696      696     696     696    696       696
+XXLANDC or VANDC                 210      600     600     600    600       600
+XXLXOR or VXOR                   954    1,824   1,824   1,824  1,824     1,824
+XXEVAL                           630        0       0       0      0         0
+
+Load vector rotate constants      96       96      96      96     96        96
+
+
+These patches to add XXEVAL support add the following fusion patterns:
+
+       xxland  => xxland       xxlandc => xxland       xxlxor  => xxland
+       xxlor   => xxland       xxlnor  => xxland       xxleqv  => xxland
+       xxlorc  => xxland       xxlandc => xxlandc      xxlnand => xxland
+       xxlnand => xxlnor       xxland  => xxlxor       xxland  => xxlor
+       xxlandc => xxlxor       xxlandc => xxlor        xxlorc  => xxlnor
+       xxlorc  => xxleqv       xxlorc  => xxlorc       xxleqv  => xxlnor
+       xxlxor  => xxlxor       xxlxor  => xxlor        xxlnor  => xxlnor
+       xxlor   => xxlxor       xxlor   => xxlor        xxlor   => xxlnor
+       xxlnor  => xxlxor       xxlnor  => xxlor        xxlxor  => xxlnor
+       xxleqv  => xxlxor       xxleqv  => xxlor        xxlorc  => xxlxor
+       xxlorc  => xxlor        xxlandc => xxlnor       xxlandc => xxleqv
+       xxland  => xxlnor       xxlnand => xxlxor       xxlnand => xxlor
+       xxlnand => xxlnand      xxlorc  => xxlnand      xxleqv  => xxlnand
+       xxlnor  => xxlnand      xxlor   => xxlnand      xxlxor  => xxlnand
+       xxlandc => xxlnand      xxland  => xxlnand
+
+
+2024-10-24  Michael Meissner  <meiss...@linux.ibm.com>
 
 gcc/
 
@@ -47,8 +154,10 @@ gcc/testsuite/
        PR target/117251
        * gcc.target/powerpc/p10-vector-fused-1.c: New test.
        * gcc.target/powerpc/p10-vector-fused-2.c: Likewise.
-       * gcc.target/powerpc/xxeval-1.c: Likewise.
-       * gcc.target/powerpc/xxeval-2.c: Likewise.
+
+==================== Branch work182-sha, patch #402 was reverted 
====================
+==================== Branch work182-sha, patch #401 was reverted 
====================
+==================== Branch work182-sha, patch #400 was reverted 
====================
 
 ==================== Branch work182-sha, baseline ====================

Reply via email to