https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Jan Hubicka from comment #11)
> trunk -O3 -flto -march=native -fopenmp
>     Operation: Sharpen:
>         257
>         256
>         256
> 
>     Average: 256 Iterations Per Minute
> GCC13 -O3 -flto -march=native -fopenmp
>         257
>         256
>         256
> 
>     Average: 256 Iterations Per Minute
> clang17 O3 -flto -march=native -fopenmp
>    Operation: Sharpen:
>         257
>         256
>         256
>     Average: 256 Iterations Per Minute
> 
> So I guess I will need to try on zen3 to see if there is any difference.
> 
> the internal loop is:
>   0.00 │460:┌─→movzbl      0x2(%rdx,%rax,4),%esi ▒
>   0.02 │    │  vmovss      (%r8,%rax,4),%xmm2    ▒
>   0.95 │    │  vcvtsi2ss   %esi,%xmm0,%xmm1      ▒
>  20.22 │    │  movzbl      0x1(%rdx,%rax,4),%esi ▒
>   0.01 │    │  vfmadd231ss %xmm1,%xmm2,%xmm3     ▒
>  11.97 │    │  vcvtsi2ss   %esi,%xmm0,%xmm1      ▒
>  18.76 │    │  movzbl      (%rdx,%rax,4),%esi    ▒
>   0.00 │    │  inc         %rax                  ▒
>   0.72 │    │  vfmadd231ss %xmm1,%xmm2,%xmm4     ▒
>  12.55 │    │  vcvtsi2ss   %esi,%xmm0,%xmm1      ▒
>  14.95 │    │  vfmadd231ss %xmm1,%xmm2,%xmm5     ▒
>  15.93 │    ├──cmp         %rax,%r13             ▒
>   0.35 │    └──jne         460                                              
> 
> 
> so it still does not get....

As said the VF is going to be prohibitively large, likely the vector code
is never entered and the above is the scalar "epilogue".

Reply via email to