https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

            Bug ID: 110062
           Summary: missed vectorization in graphicsmagick
           Product: gcc
           Version: 13.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

Phoronix claims 31% performance difference between gcc13 and clang on sharpen
benchmark of graphicsmagick.  On zen3 I reproduce only 4%, but the benchmark
has only single short internal loop:

214
  97.56%  gm               gm                          [.] ConvolveImage.◆
   0.88%  gm               libgomp.so.1.0.0            [.] 0x000000000002▒
   0.67%  gm               libc.so.6                   [.] __memmove_avx_▒

GCC version:
  2.38 │500:┌─→vmovss      (%r8,%rax,4),%xmm2                            ▒
  0.04 │    │  movzbl      0x2(%rdx,%rax,4),%ebp                         ▒
  0.09 │    │  vcvtsi2ss   %ebp,%xmm0,%xmm1                              ▒
  7.44 │    │  movzbl      0x1(%rdx,%rax,4),%ebp                         ▒
  0.16 │    │  vfmadd231ss %xmm1,%xmm2,%xmm7                             ▒
 30.23 │    │  vcvtsi2ss   %ebp,%xmm0,%xmm1                              ▒
  2.38 │    │  movzbl      (%rdx,%rax,4),%ebp                            ▒
  0.03 │    │  inc         %rax                                          ▒
  0.00 │    │  vfmadd231ss %xmm1,%xmm2,%xmm9                             ▒
 22.80 │    │  vcvtsi2ss   %ebp,%xmm0,%xmm1                              ▒
  1.03 │    │  vfmadd231ss %xmm1,%xmm2,%xmm10                            ▒
 30.49 │    ├──cmp         %rax,%rbx                                     ▒
  0.18 │    └──jne         500                                           ▒

Clangs:
  0.00 │1e70:┌─→movzbl       0x2(%rdx,%rsi,4),%r9d                       ▒
  0.05 │     │  vbroadcastss (%rcx,%rsi,4),%xmm3                         ▒
  0.56 │     │  movzwl       (%rdx,%rsi,4),%r11d                         ▒
  0.05 │     │  inc          %rsi                                        ▒
  0.00 │     │  vcvtsi2ss    %r9d,%xmm10,%xmm2                           ▒
  0.71 │     │  vfmadd231ss  %xmm2,%xmm3,%xmm0                           ▒
  1.17 │     │  vmovd        %r11d,%xmm2                                 ▒
  0.00 │     │  vpmovzxbd    %xmm2,%xmm2                                 ▒
  0.06 │     │  vcvtdq2ps    %xmm2,%xmm2                                 ▒
  0.89 │     │  vfmadd231ps  %xmm2,%xmm3,%xmm1                           ▒
  1.98 │     ├──cmp          %rsi,%r10                                   ▒
  0.00 │     └──jne          1e70                                        ▒
  0.00 │      ↑ jmp          1630                                        ▒

Probably same issue as in PR109812 but reproduces on zens and loop is even
shorter.

Reply via email to