https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812
Jan Hubicka <hubicka at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Summary|GraphicsMagick resize is a |GraphicsMagick resize is a
|lot slower in GCC 13.1 vs |lot slower in GCC 13.1 vs
|Clang 16 |Clang 16 on Intel Raptor
| |Lake
--- Comment #7 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
On zen3 hardware I get GCC:
GraphicsMagick 1.3.38:
pts/graphics-magick-2.1.0 [Operation: Resizing]
Test 1 of 1
Estimated Trial Run Count: 3
Estimated Time To Completion: 4 Minutes [17:00 UTC]
Started Run 1 @ 16:57:17
Started Run 2 @ 16:58:22
Started Run 3 @ 16:59:26
Operation: Resizing:
1390
1386
1383
Average: 1386 Iterations Per Minute
Deviation: 0.25%
clang16:
GraphicsMagick 1.3.38:
pts/graphics-magick-2.1.0 [Operation: Resizing]
Test 1 of 1
Estimated Trial Run Count: 3
Estimated Time To Completion: 4 Minutes [16:54 UTC]
Started Run 1 @ 16:51:48
Started Run 2 @ 16:52:52
Started Run 3 @ 16:53:56
Operation: Resizing:
180
180
180
Average: 180 Iterations Per Minute
Deviation: 0.00%
GCC profile:
52.07% VerticalFilter._omp_fn.0
24.59% HorizontalFilter._omp_fn.0
11.78% ReadCachePixels.isra.0
Clang does not seem to have openmp in it, so to get comparable runs I added
OMP_THREAD_LIMIT=1
With this I get:
GraphicsMagick 1.3.38:
pts/graphics-magick-2.1.0 [Operation: Resizing]
Test 1 of 1
Estimated Trial Run Count: 3
Estimated Time To Completion: 4 Minutes [17:17 UTC]
Started Run 1 @ 17:14:14
Started Run 2 @ 17:15:18
Started Run 3 @ 17:16:22
Operation: Resizing:
184
186
186
Average: 185 Iterations Per Minute
Deviation: 0.62%
so GCC build is still bit faster. Internal loop of VerticalFillter is:
0.00 │4a0:┌─→mov 0x8(%rdx),%rax ▒
1.33 │ │ vmovsd (%rdx),%xmm1 ▒
1.58 │ │ add $0x10,%rdx ▒
0.00 │ │ sub %r13,%rax ▒
4.77 │ │ imul %r11,%rax ▒
1.01 │ │ add %rcx,%rax ▒
0.04 │ │ movzbl 0x2(%r15,%rax,4),%r10d ▒
8.38 │ │ vcvtsi2sd %r10d,%xmm2,%xmm0 ▒
2.44 │ │ movzbl 0x1(%r15,%rax,4),%r10d ◆
1.55 │ │ movzbl (%r15,%rax,4),%eax ▒
0.00 │ │ vfmadd231sd %xmm0,%xmm1,%xmm4 ▒
13.91 │ │ vcvtsi2sd %r10d,%xmm2,%xmm0 ▒
1.86 │ │ vfmadd231sd %xmm0,%xmm1,%xmm5 ▒
13.00 │ │ vcvtsi2sd %eax,%xmm2,%xmm0 ▒
2.02 │ │ vfmadd231sd %xmm0,%xmm1,%xmm3 ▒
12.54 │ ├──cmp %rdx,%rdi ▒
0.00 │ └──jne 4a0 ▒
HorisontalFiller:
0.01 │520:┌─→mov 0x8(%r8),%rdx ▒
0.96 │ │ vmovsd (%r8),%xmm1 ▒
1.93 │ │ add $0x10,%r8 ▒
0.50 │ │ sub %r15,%rdx ▒
4.02 │ │ add %r11,%rdx ▒
2.26 │ │ movzbl 0x2(%r14,%rdx,4),%ebx ▒
0.09 │ │ vcvtsi2sd %ebx,%xmm2,%xmm0 ▒
10.10 │ │ movzbl 0x1(%r14,%rdx,4),%ebx ◆
0.92 │ │ movzbl (%r14,%rdx,4),%edx ▒
1.84 │ │ vfmadd231sd %xmm0,%xmm1,%xmm4 ▒
6.82 │ │ vcvtsi2sd %ebx,%xmm2,%xmm0 ▒
11.15 │ │ vfmadd231sd %xmm0,%xmm1,%xmm3 ▒
13.81 │ │ vcvtsi2sd %edx,%xmm2,%xmm0 ▒
6.16 │ │ vfmadd231sd %xmm0,%xmm1,%xmm5 ▒
8.61 │ ├──cmp %rsi,%r8 ▒
1.56 │ └──jne 520 ▒
ReadCachePixels:
│2e0:┌─→mov (%rbx,%rax,4),%edx ▒
83.03 │ │ mov %edx,(%r12,%rax,4) ▒
12.34 │ │ inc %rax ▒
0.02 │ ├──cmp %rsi,%rax ▒
With Clang I get:
49.08% VerticalFilter
24.66% HorizontalFilter
18.41% ReadCachePixels
6.75% SyncCacheViewPixels
0.00 │1c50:┌─→mov (%rdx,%rsi,1),%r9 ▒
0.09 │ │ vmovddup -0x8(%rdx,%rsi,1),%xmm3 ▒
0.00 │ │ add $0x10,%rsi ▒
0.75 │ │ sub %rdi,%r9 ▒
0.00 │ │ imul %rcx,%r9 ▒
1.07 │ │ add %r11,%r9 ▒
0.81 │ │ movzbl 0x2(%r14,%r9,4),%r10d ▒
3.73 │ │ movzwl (%r14,%r9,4),%r9d ▒
0.00 │ │ vcvtsi2sd %r10d,%xmm14,%xmm2 ▒
0.11 │ │ vfmadd231sd %xmm2,%xmm3,%xmm1 ▒
2.57 │ │ vmovd %r9d,%xmm2 ▒
0.00 │ │ vpmovzxbd %xmm2,%xmm2 ▒
0.95 │ │ vcvtdq2pd %xmm2,%xmm2 ▒
0.74 │ │ vfmadd231pd %xmm2,%xmm3,%xmm0 ▒
11.46 │ ├──cmp %rsi,%r8 ▒
│1b50:┌─→mov (%r10,%rdi,1),%rcx ▒
0.76 │ │ vmovddup -0x8(%r10,%rdi,1),%xmm3 ▒
0.00 │ │ add $0x10,%rdi ▒
0.05 │ │ sub %r8,%rcx ▒
0.30 │ │ add %rsi,%rcx ▒
0.27 │ │ movzbl 0x2(%r14,%rcx,4),%ebp ▒
0.28 │ │ movzwl (%r14,%rcx,4),%ecx ▒
4.51 │ │ vcvtsi2sd %ebp,%xmm13,%xmm2 ▒
0.75 │ │ vfmadd231sd %xmm2,%xmm3,%xmm1 ▒
0.99 │ │ vmovd %ecx,%xmm2 ▒
0.00 │ │ vpmovzxbd %xmm2,%xmm2 ▒
0.29 │ │ vcvtdq2pd %xmm2,%xmm2 ▒
0.27 │ │ vfmadd231pd %xmm2,%xmm3,%xmm0 ▒
12.37 │ ├──cmp %rdi,%r9 ▒
0.16 │ └──jne 1b50 ▒
0.01 │ test %r10,%r10 ▒
0.01 │ ↓ jle 28b4 ▒
│ lea 0x0(,%r15,4),%rcx ▒
0.01 │ mov 0xd8(%rsp),%r10 ▒
0.00 │ lea (%rcx,%r8,4),%rcx ▒
0.01 │ lea (%rcx,%rbp,4),%rcx ▒
0.01 │ lea (%rcx,%rdi,4),%rcx ▒
0.01 │ lea (%rcx,%rax,4),%rcx ▒
0.02 │ lea (%rcx,%rdx,4),%rcx ▒
0.01 │ lea (%rcx,%r10,4),%rcx ▒
0.02 │ mov 0xc8(%rsp),%r10 ▒
0.01 │ lea (%rcx,%r9,4),%rcx ▒
0.01 │ lea (%rcx,%r13,4),%rcx ▒
0.01 │ lea (%rcx,%r11,4),%rcx ▒
0.01 │ lea (%rcx,%r12,4),%rcx ▒
0.01 │ lea (%rcx,%r10,4),%rcx ▒
0.02 │ mov 0xb8(%rsp),%r10 ▒
0.01 │ lea (%rcx,%r10,4),%rcx ▒
0.03 │ mov 0xb0(%rsp),%r10 ▒
0.01 │ lea (%rcx,%r10,4),%rcx ▒
0.01 │ mov 0xa8(%rsp),%r10 ▒
0.00 │ lea (%rcx,%r10,4),%rcx ▒
0.02 │ mov 0x98(%rsp),%r10 ▒
0.00 │ lea (%rcx,%rsi,4),%rcx ▒
0.03 │ lea (%rcx,%r10,4),%rcx ▒
0.02 │ mov 0x88(%rsp),%r10 ▒
0.01 │ lea (%rcx,%r10,4),%rcx ▒
0.02 │ mov 0xa0(%rsp),%r10 ▒
0.00 │ lea (%rcx,%r10,4),%rcx ▒
0.01 │ mov 0x90(%rsp),%r10 ▒
0.00 │ lea (%rcx,%r10,4),%rcx ▒
0.02 │ mov 0x58(%rsp),%r10 ▒
0.02 │ lea (%rcx,%r10,4),%rcx ▒
0.03 │ mov 0x50(%rsp),%r10 ▒
0.00 │ lea (%rcx,%r10,4),%rcx ▒
0.02 │ mov 0x48(%rsp),%r10 ▒
0.01 │ lea (%rcx,%r10,4),%rcx ▒
0.02 │ mov 0x40(%rsp),%r10 ▒
0.00 │ lea (%rcx,%r10,4),%rcx ▒
0.03 │ mov 0x38(%rsp),%r10 ▒
0.02 │ lea (%rcx,%r10,4),%rcx ▒
0.03 │ mov 0x60(%rsp),%r10 ▒
0.01 │ lea (%rcx,%r10,4),%rcx ▒
0.02 │ mov 0x68(%rsp),%r10 ▒
0.00 │ lea (%rcx,%r10,4),%rcx ▒
0.03 │ mov 0x70(%rsp),%r10 ▒
0.00 │ lea (%rcx,%r10,4),%rcx ▒
0.02 │ mov 0x78(%rsp),%r10 ▒
0.03 │ lea (%rcx,%r10,4),%rcx ◆
0.03 │ mov 0x80(%rsp),%r10 ▒
0.01 │ lea (%rcx,%r10,4),%rcx ▒
0.03 │ add 0x28(%rsp),%rcx ▒
0.03 │ mov %rcx,0xf0(%rsp) ▒
0.00 │ xor %ecx,%ecx ▒
0.00 │ xor %ecx,%ecx ▒
│2584: mov 0xf0(%rsp),%r10 ▒
0.01 │ mov (%r10,%rcx,4),%r10d ▒
3.58 │ inc %rcx ▒
0.03 │ mov %r10d,(%r14) ▒
0.02 │ mov 0x30(%rsp),%r10 ▒
0.01 │ add $0x4,%r14 ▒
0.01 │ mov (%r10),%r10 ▒
0.06 │ cmp %r10,%rcx ▒
0.05 │ ↑ jl 2584 ▒
So I suppose the filler loops are vectorized while memcpy is unrolled (in very
odd way). I guesss the vectorization does not help on zen3 but may help on
Raptor Lake.