https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117717
Bug ID: 117717 Summary: Auto-vectorization resulting in poor performance Product: gcc Version: 14.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: iansseijelly at berkeley dot edu Target Milestone: --- See: https://github.com/iansseijelly/what-is-wrong/tree/main for scripts and code to reproduce the issue. When activating autovectorization (-O3 for gcc11 and -O2 & -O3 for gcc12+), it would produce the following optimization message: ``` sort.c:31:26: optimized: basic block part vectorized using 8 byte vectors ``` Leading to the creation of the following assembly: ``` void bubble_sort (uint32_t *a, uint32_t n) { 11d0: 48 89 f8 mov %rdi,%rax s = 0; 11d3: 45 31 c0 xor %r8d,%r8d 11d6: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1) 11dd: 00 00 00 00 11e1: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1) 11e8: 00 00 00 00 11ec: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1) 11f3: 00 00 00 00 11f7: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1) 11fe: 00 00 if (a[i] < a[i - 1]) { 1200: f3 0f 7e 00 movq (%rax),%xmm0 1204: 66 0f 70 c8 e5 pshufd $0xe5,%xmm0,%xmm1 1209: 66 0f 7e c2 movd %xmm0,%edx 120d: 66 0f 7e c9 movd %xmm1,%ecx 1211: 39 d1 cmp %edx,%ecx 1213: 73 0f jae 1224 <bubble_sort+0x64> a[i - 1] = t; 1215: 66 0f 70 c0 e1 pshufd $0xe1,%xmm0,%xmm0 s = 1; 121a: 41 b8 01 00 00 00 mov $0x1,%r8d a[i - 1] = t; 1220: 66 0f d6 00 movq %xmm0,(%rax) for (i = 1; i < n; i++) { 1224: 48 83 c0 04 add $0x4,%rax 1228: 48 39 f0 cmp %rsi,%rax 122b: 75 d3 jne 1200 <bubble_sort+0x40> while (s) { 122d: 45 85 c0 test %r8d,%r8d 1230: 75 9e jne 11d0 <bubble_sort+0x10> } 1232: c3 ret 1233: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1) 123a: 00 00 00 00 123e: 66 90 xchg %ax,%ax ``` For the bubble sort code: ``` void bubble_sort (uint32_t *a, uint32_t n) { uint32_t i, t, s = 1; while (s) { s = 0; for (i = 1; i < n; i++) { if (a[i] < a[i - 1]) { t = a[i]; a[i] = a[i - 1]; a[i - 1] = t; s = 1; } } } } ``` This inefficient use of fixed-width simd registers made the code ~2x worse than O0 and ~5.5x worse than O3 with vectorization explicitly disabled. However, this also means the part for populating arrays are not vectorized, leading to performance drops. This phenomenum of pathologically using avx for sorting is observable on other x86_64 machines like AMD and Xeon, and is not observable on clang+x86, clang+arm, gcc+arm, or gcc+riscv. Additionally, when use fdo or autofdo in gcc11, it seems that gcc also tries to vectorize the sorting basic block, despite under O2 (where %xmm should not be used at all).