[Bug rtl-optimization/94037] New: Runtime varies 2x just by order of two int assignments

ncm at cantrip dot org Wed, 04 Mar 2020 12:27:23 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94037


            Bug ID: 94037
           Summary: Runtime varies 2x just by order of two int assignments
           Product: gcc
           Version: 9.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ncm at cantrip dot org
  Target Milestone: ---

(This report re-uses some code from
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165 but identifies an entirely
different problem.)

Given:

```
bool swap_if(bool c, int& a, int& b) {
  int v[2] = { a, b };
#ifdef FAST  /* 4.6s */
  b = v[1-c], a = v[c];
#else /* SLOW, 9.8s */
  a = v[c], b = v[!c];
#endif
  return c;
}

int* partition(int* begin, int* end) {
  int pivot = end[-1];
  int* left = begin;
  for (int* right = begin; right < end - 1; ++right) {
    left += swap_if(*right <= pivot, *left, *right);
  }
  int tmp = *left; *left = end[-1], end[-1] = tmp;
  return left;
}

void quicksort(int* begin, int* end) {
  while (end - begin > 1) {
    int* mid = partition(begin, end);
    quicksort(begin, mid);
    begin = mid + 1;
} }

static const int size = 100'000'000;

#include <sys/mman.h>
#include <unistd.h>
#include <fcntl.h>

int main(int, char**) {
  int fd = ::open("1e8ints", O_RDONLY);
  int perms = PROT_READ|PROT_WRITE;
  int flags = MAP_PRIVATE|MAP_POPULATE|MAP_NORESERVE;
  auto* a = (int*) ::mmap(nullptr, size * sizeof(int), perms, flags, fd, 0);

  quicksort(a, a + size);

  return a[0] == a[size - 1];
}
```
after
```
  $ dd if=/dev/urandom of=1e8ints bs=1000000 count=400
```

The run time of the the program above, built "-O3 -march=skylake"
vs. "-DFAST -O3 -march=skylake", varies by 2x on Skylake, similarly
on Haswell. Both cases are almost equally fast on Clang, matching 
G++'s fast version. The difference between "!c" and "1-c" in the 
array index exacerbates the disparity.

Godbolt `<https://godbolt.org/z/w-buUF>` says, slow:
```
        movl    (%rax), %edx
        movl    (%rbx), %esi
        movl    %esi, 8(%rsp)
        movl    %edx, 12(%rsp)
        cmpl    %edx, %ecx
        setge   %sil
        movzbl  %sil, %esi
        movl    8(%rsp,%rsi,4), %esi
        movl    %esi, (%rbx)
        setl    %sil
        movzbl  %sil, %esi
        movl    8(%rsp,%rsi,4), %esi
        movl    %esi, (%rax)
```
and 2x as fast:
```
        movl    (%rax), %ecx
        cmpl    %ecx, %r8d
        setge   %dl
        movzbl  %dl, %edx
        movl    (%rbx), %esi
        movl    %esi, 8(%rsp)
        movl    %ecx, 12(%rsp)
        movl    %r9d, %esi
        subl    %edx, %esi
        movslq  %esi, %rsi
        movl    8(%rsp,%rsi,4), %esi
        movl    %esi, (%rax)
        movslq  %edx, %rdx
        movl    8(%rsp,%rdx,4), %edx
        movl    %edx, (%rbx)
        cmpl    %ecx, %r8d
```
Clang 9.0.0, -DFAST, for reference:
```
        movl    (%rcx), %r11d
        xorl    %edx, %edx
        xorl    %esi, %esi
        cmpl    %r8d, %r11d
        setle   %dl
        setg    %sil
        movl    (%rbx), %eax
        movl    %eax, (%rsp)
        movl    %r11d, 4(%rsp)
        movl    (%rsp,%rsi,4), %eax
        movl    %eax, (%rcx)
        movl    (%rsp,%rdx,4), %eax
        movl    %eax, (%rbx)
```

[Bug rtl-optimization/94037] New: Runtime varies 2x just by order of two int assignments

Reply via email to