https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94037
Bug ID: 94037 Summary: Runtime varies 2x just by order of two int assignments Product: gcc Version: 9.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ncm at cantrip dot org Target Milestone: --- (This report re-uses some code from https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165 but identifies an entirely different problem.) Given: ``` bool swap_if(bool c, int& a, int& b) { int v[2] = { a, b }; #ifdef FAST /* 4.6s */ b = v[1-c], a = v[c]; #else /* SLOW, 9.8s */ a = v[c], b = v[!c]; #endif return c; } int* partition(int* begin, int* end) { int pivot = end[-1]; int* left = begin; for (int* right = begin; right < end - 1; ++right) { left += swap_if(*right <= pivot, *left, *right); } int tmp = *left; *left = end[-1], end[-1] = tmp; return left; } void quicksort(int* begin, int* end) { while (end - begin > 1) { int* mid = partition(begin, end); quicksort(begin, mid); begin = mid + 1; } } static const int size = 100'000'000; #include <sys/mman.h> #include <unistd.h> #include <fcntl.h> int main(int, char**) { int fd = ::open("1e8ints", O_RDONLY); int perms = PROT_READ|PROT_WRITE; int flags = MAP_PRIVATE|MAP_POPULATE|MAP_NORESERVE; auto* a = (int*) ::mmap(nullptr, size * sizeof(int), perms, flags, fd, 0); quicksort(a, a + size); return a[0] == a[size - 1]; } ``` after ``` $ dd if=/dev/urandom of=1e8ints bs=1000000 count=400 ``` The run time of the the program above, built "-O3 -march=skylake" vs. "-DFAST -O3 -march=skylake", varies by 2x on Skylake, similarly on Haswell. Both cases are almost equally fast on Clang, matching G++'s fast version. The difference between "!c" and "1-c" in the array index exacerbates the disparity. Godbolt `<https://godbolt.org/z/w-buUF>` says, slow: ``` movl (%rax), %edx movl (%rbx), %esi movl %esi, 8(%rsp) movl %edx, 12(%rsp) cmpl %edx, %ecx setge %sil movzbl %sil, %esi movl 8(%rsp,%rsi,4), %esi movl %esi, (%rbx) setl %sil movzbl %sil, %esi movl 8(%rsp,%rsi,4), %esi movl %esi, (%rax) ``` and 2x as fast: ``` movl (%rax), %ecx cmpl %ecx, %r8d setge %dl movzbl %dl, %edx movl (%rbx), %esi movl %esi, 8(%rsp) movl %ecx, 12(%rsp) movl %r9d, %esi subl %edx, %esi movslq %esi, %rsi movl 8(%rsp,%rsi,4), %esi movl %esi, (%rax) movslq %edx, %rdx movl 8(%rsp,%rdx,4), %edx movl %edx, (%rbx) cmpl %ecx, %r8d ``` Clang 9.0.0, -DFAST, for reference: ``` movl (%rcx), %r11d xorl %edx, %edx xorl %esi, %esi cmpl %r8d, %r11d setle %dl setg %sil movl (%rbx), %eax movl %eax, (%rsp) movl %r11d, 4(%rsp) movl (%rsp,%rsi,4), %eax movl %eax, (%rcx) movl (%rsp,%rdx,4), %eax movl %eax, (%rbx) ```