https://gcc.gnu.org/g:d9c908b75039653f2b7717b4b7cdffdc4f0fcc7d
commit r15-5650-gd9c908b75039653f2b7717b4b7cdffdc4f0fcc7d Author: Richard Biener <rguent...@suse.de> Date: Fri Nov 22 13:58:08 2024 +0100 Add extra 64bit SSE vector epilogue in some cases Similar to the X86_TUNE_AVX512_TWO_EPILOGUES tuning which enables an extra 128bit SSE vector epilouge when doing 512bit AVX512 vectorization in the main loop the following allows a 64bit SSE vector epilogue to be generated when the previous vector epilogue still had a vectorization factor of 16 or larger (which usually means we are operating on char data). This effectively applies to 256bit and 512bit AVX2/AVX512 main loops, a 128bit SSE main loop would already get a 64bit SSE vector epilogue. Together with X86_TUNE_AVX512_TWO_EPILOGUES this means three vector epilogues for 512bit and two vector epilogues when enabling 256bit vectorization. I have not added another tunable for this RFC - suggestions on how to avoid inflation there welcome. This speeds up 525.x264_r to within 5% of the -mprefer-vector-size=128 speed with -mprefer-vector-size=256 or -mprefer-vector-size=512 (the latter only when -mtune-crtl=avx512_two_epilogues is in effect). I have not done any further benchmarking, this merely shows the possibility and looks for guidance on how to expose this to the uarch tunings or to the user (at all?) if not gating on any uarch specific tuning. Note 64bit SSE isn't a native vector size so we rely on emulation being "complete" (if not epilogue vectorization will only fail, so it's "safe" in this regard). With AVX512 ISA available an alternative is a predicated epilog, but due to possible STLF issues user control would be required here. * config/i386/i386.cc (ix86_vector_costs::finish_cost): For an 128bit SSE epilogue request a 64bit SSE epilogue if the 128bit SSE epilogue VF was 16 or higher. Diff: --- gcc/config/i386/i386.cc | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index b4d11dff4ee5..8ab91202f63b 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -25494,6 +25494,13 @@ ix86_vector_costs::finish_cost (const vector_costs *scalar_costs) && GET_MODE_SIZE (loop_vinfo->vector_mode) == 32) m_suggested_epilogue_mode = V16QImode; } + /* When a 128bit SSE vectorized epilogue still has a VF of 16 or larger + enable a 64bit SSE epilogue. */ + if (loop_vinfo + && LOOP_VINFO_EPILOGUE_P (loop_vinfo) + && GET_MODE_SIZE (loop_vinfo->vector_mode) == 16 + && LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant () >= 16) + m_suggested_epilogue_mode = V8QImode; vector_costs::finish_cost (scalar_costs); }