https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117717

            Bug ID: 117717
           Summary: Auto-vectorization resulting in poor performance
           Product: gcc
           Version: 14.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: iansseijelly at berkeley dot edu
  Target Milestone: ---

See: https://github.com/iansseijelly/what-is-wrong/tree/main for scripts and
code to reproduce the issue. 

When activating autovectorization (-O3 for gcc11 and -O2 & -O3 for gcc12+), it
would produce the following optimization message:
```
sort.c:31:26: optimized: basic block part vectorized using 8 byte vectors
```
Leading to the creation of the following assembly:
```
void bubble_sort (uint32_t *a, uint32_t n) {
    11d0:       48 89 f8                mov    %rdi,%rax
        s = 0;
    11d3:       45 31 c0                xor    %r8d,%r8d
    11d6:       66 66 2e 0f 1f 84 00    data16 cs nopw 0x0(%rax,%rax,1)
    11dd:       00 00 00 00 
    11e1:       66 66 2e 0f 1f 84 00    data16 cs nopw 0x0(%rax,%rax,1)
    11e8:       00 00 00 00 
    11ec:       66 66 2e 0f 1f 84 00    data16 cs nopw 0x0(%rax,%rax,1)
    11f3:       00 00 00 00 
    11f7:       66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
    11fe:       00 00 
            if (a[i] < a[i - 1]) {
    1200:       f3 0f 7e 00             movq   (%rax),%xmm0
    1204:       66 0f 70 c8 e5          pshufd $0xe5,%xmm0,%xmm1
    1209:       66 0f 7e c2             movd   %xmm0,%edx
    120d:       66 0f 7e c9             movd   %xmm1,%ecx
    1211:       39 d1                   cmp    %edx,%ecx
    1213:       73 0f                   jae    1224 <bubble_sort+0x64>
                a[i - 1] = t;
    1215:       66 0f 70 c0 e1          pshufd $0xe1,%xmm0,%xmm0
                s = 1;
    121a:       41 b8 01 00 00 00       mov    $0x1,%r8d
                a[i - 1] = t;
    1220:       66 0f d6 00             movq   %xmm0,(%rax)
        for (i = 1; i < n; i++) {
    1224:       48 83 c0 04             add    $0x4,%rax
    1228:       48 39 f0                cmp    %rsi,%rax
    122b:       75 d3                   jne    1200 <bubble_sort+0x40>
    while (s) {
    122d:       45 85 c0                test   %r8d,%r8d
    1230:       75 9e                   jne    11d0 <bubble_sort+0x10>
}
    1232:       c3                      ret    
    1233:       66 66 2e 0f 1f 84 00    data16 cs nopw 0x0(%rax,%rax,1)
    123a:       00 00 00 00 
    123e:       66 90                   xchg   %ax,%ax
```
For the bubble sort code:
```
void bubble_sort (uint32_t *a, uint32_t n) {
    uint32_t i, t, s = 1;
    while (s) {
        s = 0;
        for (i = 1; i < n; i++) {
            if (a[i] < a[i - 1]) {
                t = a[i];
                a[i] = a[i - 1];
                a[i - 1] = t;
                s = 1;
            }
        }
    }
}
```
This inefficient use of fixed-width simd registers made the code ~2x worse than
O0 and ~5.5x worse than O3 with vectorization explicitly disabled. 
However, this also means the part for populating arrays are not vectorized,
leading to performance drops. 
This phenomenum of pathologically using avx for sorting is observable on other
x86_64 machines like AMD and Xeon, and is not observable on clang+x86,
clang+arm, gcc+arm, or gcc+riscv. 

Additionally, when use fdo or autofdo in gcc11, it seems that gcc also tries to
vectorize the sorting basic block, despite under O2 (where %xmm should not be
used at all).

Reply via email to