https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062
--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Jan Hubicka from comment #11) > trunk -O3 -flto -march=native -fopenmp > Operation: Sharpen: > 257 > 256 > 256 > > Average: 256 Iterations Per Minute > GCC13 -O3 -flto -march=native -fopenmp > 257 > 256 > 256 > > Average: 256 Iterations Per Minute > clang17 O3 -flto -march=native -fopenmp > Operation: Sharpen: > 257 > 256 > 256 > Average: 256 Iterations Per Minute > > So I guess I will need to try on zen3 to see if there is any difference. > > the internal loop is: > 0.00 │460:┌─→movzbl 0x2(%rdx,%rax,4),%esi ▒ > 0.02 │ │ vmovss (%r8,%rax,4),%xmm2 ▒ > 0.95 │ │ vcvtsi2ss %esi,%xmm0,%xmm1 ▒ > 20.22 │ │ movzbl 0x1(%rdx,%rax,4),%esi ▒ > 0.01 │ │ vfmadd231ss %xmm1,%xmm2,%xmm3 ▒ > 11.97 │ │ vcvtsi2ss %esi,%xmm0,%xmm1 ▒ > 18.76 │ │ movzbl (%rdx,%rax,4),%esi ▒ > 0.00 │ │ inc %rax ▒ > 0.72 │ │ vfmadd231ss %xmm1,%xmm2,%xmm4 ▒ > 12.55 │ │ vcvtsi2ss %esi,%xmm0,%xmm1 ▒ > 14.95 │ │ vfmadd231ss %xmm1,%xmm2,%xmm5 ▒ > 15.93 │ ├──cmp %rax,%r13 ▒ > 0.35 │ └──jne 460 > > > so it still does not get.... As said the VF is going to be prohibitively large, likely the vector code is never entered and the above is the scalar "epilogue".