https://gcc.gnu.org/g:509e34b08bd2cfa42caf0b8dcc79d23247516648

commit 509e34b08bd2cfa42caf0b8dcc79d23247516648
Author: Michael Meissner <meiss...@linux.ibm.com>
Date:   Thu Jan 2 18:12:33 2025 -0500

    Update ChangeLog.*

Diff:
---
 gcc/ChangeLog.sha | 146 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 146 insertions(+)

diff --git a/gcc/ChangeLog.sha b/gcc/ChangeLog.sha
index 3ca691489d87..ece851e7aa1f 100644
--- a/gcc/ChangeLog.sha
+++ b/gcc/ChangeLog.sha
@@ -1,5 +1,151 @@
+==================== Branch work190-sha, patch #400 ====================
+
+PR target/117251: Add PowerPC XXEVAL support to speed up SHA3 calculations
+
+The multibuff.c benchmark attached to the PR target/117251 compiled for Power10
+PowerPC that implement SHA3 has a slowdown in the current trunk and GCC 14
+compared to GCC 11 - GCC 13, due to excessive amounts of spilling.
+
+The main function for the multibuf.c file has 3,747 lines, all of which are
+using vector unsigned long long.  There are 696 vector rotates (all rotates are
+constant), 1,824 vector xor's and 600 vector andc's.
+
+In looking at it, the main thing that steps out is the reason for either
+spilling or moving variables is the support in fusion.md (generated by
+genfusion.pl) that tries to fuse the vec_andc feeding into vec_xor, and other
+vec_xor's feeding into vec_xor.
+
+On the powerpc for power10, there is a special fusion mode that happens if the
+machine has a VANDC or VXOR instruction that is adjacent to a VXOR instruction
+and the VANDC/VXOR feeds into the 2nd VXOR instruction.
+
+While the Power10 has 64 vector registers (which uses the XXL prefix to do
+logical operations), the fusion only works with the older Altivec instruction
+set (which uses the V prefix).  The Altivec instruction only has 32 vector
+registers (which are overlaid over the VSX vector registers 32-63).
+
+By having the combiner patterns fuse_vandc_vxor and fuse_vxor_vxor to do this
+fusion, it means that the register allocator has more register pressure for the
+traditional Altivec registers instead of the VSX registers.
+
+In addition, since there are vector rotates, these rotates only work on the
+traditional Altivec registers, which adds to the Altivec register pressure.
+
+Finally in addition to doing the explicit xor, andc, and rotates using the
+Altivec registers, we have to also load vector constants for the rotate amount
+and these registers also are allocated as Altivec registers.
+
+Current trunk and GCC 12-14 have more vector spills than GCC 11, but GCC 11 has
+many more vector moves that the later compilers.  Thus even though it has way
+less spills, the vector moves are why GCC 11 have the slowest results.
+
+There is an instruction that was added in power10 (XXEVAL) that does provide
+fusion between VSX vectors that includes ANDC->XOR and XOR->XOR fusion.
+
+The latency of XXEVAL is slightly more than the fused VANDC/VXOR or VXOR/VXOR,
+so I have written the patch to prefer doing the Altivec instructions if they
+don't need a temporary register.
+
+Here are the results for adding support for XXEVAL for the multibuff.c
+benchmark attached to the PR.  Note that we essentially recover the speed with
+this patch that were lost with GCC 14 and the current trunk:
+
+                              XXEVAL    Trunk   GCC14   GCC13   GCC12    GCC11
+                              ------    -----   -----   -----   -----    -----
+Benchmark time in seconds       5.53     6.15    6.26    5.57    5.61     9.56
+
+Fuse VANDC -> VXOR               209     600      600     600     600      600
+Fuse VXOR -> VXOR                  0     240      240     120     120      120
+XXEVAL to fuse ANDC -> XOR       391       0        0       0       0        0
+XXEVAL to fuse XOR -> XOR        240       0        0       0       0        0
+
+Spill vector to stack             78     364      364     172     184      110
+Load spilled vector from stack   431     962      962     713     723      166
+Vector moves                      10     100      100      70      72    3,055
+
+Vector rotate right              696     696      696     696     696      696
+XXLANDC or VANDC                 209     600      600     600     600      600
+XXLXOR or VXOR                   953   1,824    1,824   1,824   1,824    1,825
+XXEVAL                           631       0        0       0       0        0
+
+Load vector rotate constants      24      24       24      24      24       24
+
+
+Here are the results for adding support for XXEVAL for the singlebuff.c
+benchmark attached to the PR.  Note that adding XXEVAL greatly speeds up this
+particular benchmark:
+
+                              XXEVAL    Trunk   GCC14   GCC13   GCC12    GCC11
+                              ------    -----   -----   -----   -----    -----
+Benchmark time in seconds       4.46     5.40    5.40    5.35    5.36     7.54
+
+Fuse VANDC -> VXOR               210      600     600     600     600      600
+Fuse VXOR -> VXOR                  0      240     240     120     120      120
+XXEVAL to fuse ANDC -> XOR       390        0       0       0      0         0
+XXEVAL to fuse XOR -> XOR        240        0       0       0      0         0
+
+Spill vector to stack            113      379     379     382    382        63
+Load spilled vector from stack   333      796     796     757    757        68
+Vector moves                      34       80      80     119    119     2,409
+
+Vector rotate right              696      696     696     696    696       696
+XXLANDC or VANDC                 210      600     600     600    600       600
+XXLXOR or VXOR                   954    1,824   1,824   1,824  1,824     1,824
+XXEVAL                           630        0       0       0      0         0
+
+Load vector rotate constants      96       96      96      96     96        96
+
+
+These patches add the following fusion patterns:
+
+       xxland  => xxland       xxlandc => xxland       xxlxor  => xxland
+       xxlor   => xxland       xxlnor  => xxland       xxleqv  => xxland
+       xxlorc  => xxland       xxlandc => xxlandc      xxlnand => xxland
+       xxlnand => xxlnor       xxland  => xxlxor       xxland  => xxlor
+       xxlandc => xxlxor       xxlandc => xxlor        xxlorc  => xxlnor
+       xxlorc  => xxleqv       xxlorc  => xxlorc       xxleqv  => xxlnor
+       xxlxor  => xxlxor       xxlxor  => xxlor        xxlnor  => xxlnor
+       xxlor   => xxlxor       xxlor   => xxlor        xxlor   => xxlnor
+       xxlnor  => xxlxor       xxlnor  => xxlor        xxlxor  => xxlnor
+       xxleqv  => xxlxor       xxleqv  => xxlor        xxlorc  => xxlxor
+       xxlorc  => xxlor        xxlandc => xxlnor       xxlandc => xxleqv
+       xxland  => xxlnor       xxlnand => xxlxor       xxlnand => xxlor
+       xxlnand => xxlnand      xxlorc  => xxlnand      xxleqv  => xxlnand
+       xxlnor  => xxlnand      xxlor   => xxlnand      xxlxor  => xxlnand
+       xxlandc => xxlnand      xxland  => xxlnand
+
+
+2025-01-02  Michael Meissner  <meiss...@linux.ibm.com>
+
+gcc/
+
+       PR target/117251
+       * config/rs6000/fusion.md: Regenerate.
+       * config/rs6000/genfusion.pl (gen_logical_addsubf): Add support to
+       generate vector/vector logical fusion if XXEVAL supports the fusion.
+       * config/rs6000/predicates.md (vector_fusion_operand): New predicate.
+       * config/rs6000/rs6000.cc (rs6000_opt_vars): Add -mxxeval.
+       * config/rs6000/rs6000.md (isa attribute): Add xxeval.
+       (enabled attribute): Add support for -mxxeval.
+       * config/rs6000/rs6000.opt (-mxxeval): New switch.
+
+gcc/testsuite/
+
+       PR target/117251
+       * gcc.target/powerpc/p10-vector-fused-1.c: New test.
+       * gcc.target/powerpc/p10-vector-fused-2.c: Likewise.
+
 ==================== Branch work190-sha, baseline ====================
 
+Add ChangeLog.sha and update REVISION.
+
+2025-01-02  Michael Meissner  <meiss...@linux.ibm.com>
+
+gcc/
+
+       * ChangeLog.sha: New file for branch.
+       * REVISION: Update.
+
 2025-01-02   Michael Meissner  <meiss...@linux.ibm.com>
 
        Clone branch

Reply via email to