[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

hubicka at gcc dot gnu.org via Gcc-bugs Sun, 28 May 2023 10:29:21 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812


Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|GraphicsMagick resize is a  |GraphicsMagick resize is a
                   |lot slower in GCC 13.1 vs   |lot slower in GCC 13.1 vs
                   |Clang 16                    |Clang 16 on Intel Raptor
                   |                            |Lake

--- Comment #7 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
On zen3 hardware I get GCC:

GraphicsMagick 1.3.38:
    pts/graphics-magick-2.1.0 [Operation: Resizing]
    Test 1 of 1
    Estimated Trial Run Count:    3                     
    Estimated Time To Completion: 4 Minutes [17:00 UTC] 
        Started Run 1 @ 16:57:17
        Started Run 2 @ 16:58:22
        Started Run 3 @ 16:59:26

    Operation: Resizing:
        1390
        1386
        1383

    Average: 1386 Iterations Per Minute
    Deviation: 0.25%

clang16:

GraphicsMagick 1.3.38:
    pts/graphics-magick-2.1.0 [Operation: Resizing]
    Test 1 of 1
    Estimated Trial Run Count:    3
    Estimated Time To Completion: 4 Minutes [16:54 UTC]
        Started Run 1 @ 16:51:48
        Started Run 2 @ 16:52:52
        Started Run 3 @ 16:53:56

    Operation: Resizing:
        180
        180
        180

    Average: 180 Iterations Per Minute
    Deviation: 0.00%


GCC profile:
  52.07%  VerticalFilter._omp_fn.0                                              
  24.59%  HorizontalFilter._omp_fn.0                                            
  11.78%  ReadCachePixels.isra.0                                                

Clang does not seem to have openmp in it, so to get comparable runs I added 
OMP_THREAD_LIMIT=1

With this I get:
GraphicsMagick 1.3.38:
    pts/graphics-magick-2.1.0 [Operation: Resizing]
    Test 1 of 1
    Estimated Trial Run Count:    3
    Estimated Time To Completion: 4 Minutes [17:17 UTC]
        Started Run 1 @ 17:14:14
        Started Run 2 @ 17:15:18
        Started Run 3 @ 17:16:22

    Operation: Resizing:
        184
        186
        186

    Average: 185 Iterations Per Minute
    Deviation: 0.62%

so GCC build is still bit faster. Internal loop of VerticalFillter is:
  0.00 │4a0:┌─→mov          0x8(%rdx),%rax                                  ▒
  1.33 │    │  vmovsd       (%rdx),%xmm1                                    ▒
  1.58 │    │  add          $0x10,%rdx                                      ▒
  0.00 │    │  sub          %r13,%rax                                       ▒
  4.77 │    │  imul         %r11,%rax                                       ▒
  1.01 │    │  add          %rcx,%rax                                       ▒
  0.04 │    │  movzbl       0x2(%r15,%rax,4),%r10d                          ▒
  8.38 │    │  vcvtsi2sd    %r10d,%xmm2,%xmm0                               ▒
  2.44 │    │  movzbl       0x1(%r15,%rax,4),%r10d                          ◆
  1.55 │    │  movzbl       (%r15,%rax,4),%eax                              ▒
  0.00 │    │  vfmadd231sd  %xmm0,%xmm1,%xmm4                               ▒
 13.91 │    │  vcvtsi2sd    %r10d,%xmm2,%xmm0                               ▒
  1.86 │    │  vfmadd231sd  %xmm0,%xmm1,%xmm5                               ▒
 13.00 │    │  vcvtsi2sd    %eax,%xmm2,%xmm0                                ▒
  2.02 │    │  vfmadd231sd  %xmm0,%xmm1,%xmm3                               ▒
 12.54 │    ├──cmp          %rdx,%rdi                                       ▒
  0.00 │    └──jne          4a0                                             ▒

HorisontalFiller:
  0.01 │520:┌─→mov          0x8(%r8),%rdx                         ▒
  0.96 │    │  vmovsd       (%r8),%xmm1                           ▒
  1.93 │    │  add          $0x10,%r8                             ▒
  0.50 │    │  sub          %r15,%rdx                             ▒
  4.02 │    │  add          %r11,%rdx                             ▒
  2.26 │    │  movzbl       0x2(%r14,%rdx,4),%ebx                 ▒
  0.09 │    │  vcvtsi2sd    %ebx,%xmm2,%xmm0                      ▒
 10.10 │    │  movzbl       0x1(%r14,%rdx,4),%ebx                 ◆
  0.92 │    │  movzbl       (%r14,%rdx,4),%edx                    ▒
  1.84 │    │  vfmadd231sd  %xmm0,%xmm1,%xmm4                     ▒
  6.82 │    │  vcvtsi2sd    %ebx,%xmm2,%xmm0                      ▒
 11.15 │    │  vfmadd231sd  %xmm0,%xmm1,%xmm3                     ▒
 13.81 │    │  vcvtsi2sd    %edx,%xmm2,%xmm0                      ▒
  6.16 │    │  vfmadd231sd  %xmm0,%xmm1,%xmm5                     ▒
  8.61 │    ├──cmp          %rsi,%r8                              ▒
  1.56 │    └──jne          520                                   ▒

ReadCachePixels:
       │2e0:┌─→mov    (%rbx,%rax,4),%edx                          ▒
 83.03 │    │  mov    %edx,(%r12,%rax,4)                          ▒
 12.34 │    │  inc    %rax                                        ▒
  0.02 │    ├──cmp    %rsi,%rax                                   ▒

With Clang I get:
  49.08% VerticalFilter                                                         
  24.66% HorizontalFilter                                                       
  18.41% ReadCachePixels                                                        
   6.75% SyncCacheViewPixels

  0.00 │1c50:┌─→mov          (%rdx,%rsi,1),%r9                    ▒
  0.09 │     │  vmovddup     -0x8(%rdx,%rsi,1),%xmm3              ▒
  0.00 │     │  add          $0x10,%rsi                           ▒
  0.75 │     │  sub          %rdi,%r9                             ▒
  0.00 │     │  imul         %rcx,%r9                             ▒
  1.07 │     │  add          %r11,%r9                             ▒
  0.81 │     │  movzbl       0x2(%r14,%r9,4),%r10d                ▒
  3.73 │     │  movzwl       (%r14,%r9,4),%r9d                    ▒
  0.00 │     │  vcvtsi2sd    %r10d,%xmm14,%xmm2                   ▒
  0.11 │     │  vfmadd231sd  %xmm2,%xmm3,%xmm1                    ▒
  2.57 │     │  vmovd        %r9d,%xmm2                           ▒
  0.00 │     │  vpmovzxbd    %xmm2,%xmm2                          ▒
  0.95 │     │  vcvtdq2pd    %xmm2,%xmm2                          ▒
  0.74 │     │  vfmadd231pd  %xmm2,%xmm3,%xmm0                    ▒
 11.46 │     ├──cmp          %rsi,%r8                             ▒

       │1b50:┌─→mov          (%r10,%rdi,1),%rcx                   ▒
  0.76 │     │  vmovddup     -0x8(%r10,%rdi,1),%xmm3              ▒
  0.00 │     │  add          $0x10,%rdi                           ▒
  0.05 │     │  sub          %r8,%rcx                             ▒
  0.30 │     │  add          %rsi,%rcx                            ▒
  0.27 │     │  movzbl       0x2(%r14,%rcx,4),%ebp                ▒
  0.28 │     │  movzwl       (%r14,%rcx,4),%ecx                   ▒
  4.51 │     │  vcvtsi2sd    %ebp,%xmm13,%xmm2                    ▒
  0.75 │     │  vfmadd231sd  %xmm2,%xmm3,%xmm1                    ▒
  0.99 │     │  vmovd        %ecx,%xmm2                           ▒
  0.00 │     │  vpmovzxbd    %xmm2,%xmm2                          ▒
  0.29 │     │  vcvtdq2pd    %xmm2,%xmm2                          ▒
  0.27 │     │  vfmadd231pd  %xmm2,%xmm3,%xmm0                    ▒
 12.37 │     ├──cmp          %rdi,%r9                             ▒
  0.16 │     └──jne          1b50                                 ▒

  0.01 │        test    %r10,%r10                                 ▒
  0.01 │      ↓ jle     28b4                                      ▒
       │        lea     0x0(,%r15,4),%rcx                         ▒
  0.01 │        mov     0xd8(%rsp),%r10                           ▒
  0.00 │        lea     (%rcx,%r8,4),%rcx                         ▒
  0.01 │        lea     (%rcx,%rbp,4),%rcx                        ▒
  0.01 │        lea     (%rcx,%rdi,4),%rcx                        ▒
  0.01 │        lea     (%rcx,%rax,4),%rcx                        ▒
  0.02 │        lea     (%rcx,%rdx,4),%rcx                        ▒
  0.01 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0xc8(%rsp),%r10                           ▒
  0.01 │        lea     (%rcx,%r9,4),%rcx                         ▒
  0.01 │        lea     (%rcx,%r13,4),%rcx                        ▒
  0.01 │        lea     (%rcx,%r11,4),%rcx                        ▒
  0.01 │        lea     (%rcx,%r12,4),%rcx                        ▒
  0.01 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0xb8(%rsp),%r10                           ▒
  0.01 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.03 │        mov     0xb0(%rsp),%r10                           ▒
  0.01 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.01 │        mov     0xa8(%rsp),%r10                           ▒
  0.00 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0x98(%rsp),%r10                           ▒
  0.00 │        lea     (%rcx,%rsi,4),%rcx                        ▒
  0.03 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0x88(%rsp),%r10                           ▒
  0.01 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0xa0(%rsp),%r10                           ▒
  0.00 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.01 │        mov     0x90(%rsp),%r10                           ▒
  0.00 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0x58(%rsp),%r10                           ▒
  0.02 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.03 │        mov     0x50(%rsp),%r10                           ▒
  0.00 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0x48(%rsp),%r10                           ▒
  0.01 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0x40(%rsp),%r10                           ▒
  0.00 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.03 │        mov     0x38(%rsp),%r10                           ▒
  0.02 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.03 │        mov     0x60(%rsp),%r10                           ▒
  0.01 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0x68(%rsp),%r10                           ▒
  0.00 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.03 │        mov     0x70(%rsp),%r10                           ▒
  0.00 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.02 │        mov     0x78(%rsp),%r10                           ▒
  0.03 │        lea     (%rcx,%r10,4),%rcx                        ◆
  0.03 │        mov     0x80(%rsp),%r10                           ▒
  0.01 │        lea     (%rcx,%r10,4),%rcx                        ▒
  0.03 │        add     0x28(%rsp),%rcx                           ▒
  0.03 │        mov     %rcx,0xf0(%rsp)                           ▒
  0.00 │        xor     %ecx,%ecx                                 ▒
 0.00 │        xor     %ecx,%ecx                                 ▒
       │2584:   mov     0xf0(%rsp),%r10                           ▒
  0.01 │        mov     (%r10,%rcx,4),%r10d                       ▒
  3.58 │        inc     %rcx                                      ▒
  0.03 │        mov     %r10d,(%r14)                              ▒
  0.02 │        mov     0x30(%rsp),%r10                           ▒
  0.01 │        add     $0x4,%r14                                 ▒
  0.01 │        mov     (%r10),%r10                               ▒
  0.06 │        cmp     %r10,%rcx                                 ▒
  0.05 │      ↑ jl      2584                                      ▒

So I suppose the filler loops are vectorized while memcpy is unrolled (in very
odd way).  I guesss the vectorization does not help on zen3 but may help on
Raptor Lake.

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

Reply via email to