https://gcc.gnu.org/g:d9c908b75039653f2b7717b4b7cdffdc4f0fcc7d

commit r15-5650-gd9c908b75039653f2b7717b4b7cdffdc4f0fcc7d
Author: Richard Biener <rguent...@suse.de>
Date:   Fri Nov 22 13:58:08 2024 +0100

    Add extra 64bit SSE vector epilogue in some cases
    
    Similar to the X86_TUNE_AVX512_TWO_EPILOGUES tuning which enables
    an extra 128bit SSE vector epilouge when doing 512bit AVX512
    vectorization in the main loop the following allows a 64bit SSE
    vector epilogue to be generated when the previous vector epilogue
    still had a vectorization factor of 16 or larger (which usually
    means we are operating on char data).
    
    This effectively applies to 256bit and 512bit AVX2/AVX512 main loops,
    a 128bit SSE main loop would already get a 64bit SSE vector epilogue.
    
    Together with X86_TUNE_AVX512_TWO_EPILOGUES this means three
    vector epilogues for 512bit and two vector epilogues when enabling
    256bit vectorization.  I have not added another tunable for this
    RFC - suggestions on how to avoid inflation there welcome.
    
    This speeds up 525.x264_r to within 5% of the -mprefer-vector-size=128
    speed with -mprefer-vector-size=256 or -mprefer-vector-size=512
    (the latter only when -mtune-crtl=avx512_two_epilogues is in effect).
    
    I have not done any further benchmarking, this merely shows the
    possibility and looks for guidance on how to expose this to the
    uarch tunings or to the user (at all?) if not gating on any uarch
    specific tuning.
    
    Note 64bit SSE isn't a native vector size so we rely on emulation
    being "complete" (if not epilogue vectorization will only fail, so
    it's "safe" in this regard).  With AVX512 ISA available an alternative
    is a predicated epilog, but due to possible STLF issues user control
    would be required here.
    
            * config/i386/i386.cc (ix86_vector_costs::finish_cost): For an
            128bit SSE epilogue request a 64bit SSE epilogue if the 128bit
            SSE epilogue VF was 16 or higher.

Diff:
---
 gcc/config/i386/i386.cc | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index b4d11dff4ee5..8ab91202f63b 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -25494,6 +25494,13 @@ ix86_vector_costs::finish_cost (const vector_costs 
*scalar_costs)
               && GET_MODE_SIZE (loop_vinfo->vector_mode) == 32)
        m_suggested_epilogue_mode = V16QImode;
     }
+  /* When a 128bit SSE vectorized epilogue still has a VF of 16 or larger
+     enable a 64bit SSE epilogue.  */
+  if (loop_vinfo
+      && LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+      && GET_MODE_SIZE (loop_vinfo->vector_mode) == 16
+      && LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant () >= 16)
+    m_suggested_epilogue_mode = V8QImode;
 
   vector_costs::finish_cost (scalar_costs);
 }

Reply via email to