https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740

            Bug ID: 85740
           Summary: Non-optimal determining maximum in a loop
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tkoenig at gcc dot gnu.org
  Target Milestone: ---

There is a problem with loops determining the maximum (or minimum)
of a value. I noticed this when looking at inline code generated
by gfortran for maxloc/minloc.

Consider the two programs which select the position of a
maximum from a rather large integer array. They are identical,
except that b.c uses __builtin_expect in the inner loop:

$ cat b.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <x86intrin.h>

#define N (1<<28)

int foo(int *a, int n)
{
  int m, nm;
  int i;

  m = -20000;
  nm = -1;
  for (i=0; i<n; i++)
    {
      if (__builtin_expect (a[i] > m, 0))
        {
          m = a[i];
          nm = i;
        }
    }
  return nm;
}

int 
main(int argc, char **argv) {
  unsigned long long t1, t2;

  int *p = malloc(N * sizeof(int));
  for (int i=0; i<N; i++)
    p[i] = random();
  t1 = __rdtsc();
  int res = foo(p, N);
  t2 = __rdtsc();
  printf("%d\n", res);
  printf("time used: %llu cycles\n", t2-t1);
  return 0;
}
$ cat c.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <x86intrin.h>

#define N (1<<28)

int foo(int *a, int n)
{
  int m, nm;
  int i;

  m = -20000;
  nm = -1;
  for (i=0; i<n; i++)
    {
      if (a[i] > m)
        {
          m = a[i];
          nm = i;
        }
    }
  return nm;
}

int 
main(int argc, char **argv) {
  unsigned long long t1, t2;

  int *p = malloc(N * sizeof(int));
  for (int i=0; i<N; i++)
    p[i] = random();
  t1 = __rdtsc();
  int res = foo(p, N);
  t2 = __rdtsc();
  printf("%d\n", res);
  printf("time used: %llu cycles\n", t2-t1);
  return 0;
}

$ gcc -Ofast b.c && ./a.out
13068230
time used: 281813216 cycles
$ gcc -Ofast c.c && ./a.out
13068230
time used: 492848802 cycles

So, the program using the __builtin_expect is faster by
a factor of around 1.75, quite significant.

The problem is obvious from the assembly. The inner loop has, for
the version without the hint,

.L6:
        movq    %rcx, %rdx
.L4:
        movl    (%rdi,%rdx,4), %ecx
        cmpl    %ecx, %esi
        jge     .L3
        movl    %edx, %eax
        movl    %ecx, %esi
.L3:
        leaq    1(%rdx), %rcx
        cmpq    %rdx, %r8
        jne     .L6
.L1:
        ret

so what is likely to be the most common case (not finding a higher
maximum) leads to a jump over two instrucitons.  On the other hand,
the assembly for the version with the hint is

.L6:
        movq    %rcx, %rdx
.L4:
        movl    (%rdi,%rdx,4), %ecx
        cmpl    %ecx, %esi
        jl      .L8
.L3:
        leaq    1(%rdx), %rcx
        cmpq    %rdx, %r8
        jne     .L6
.L1:
        ret
        .p2align 4,,10
        .p2align 3
.L8:
        movl    %edx, %eax
        movl    %ecx, %esi
        jmp     .L3

so the control flow is not interrupted in the most common case.

The version with the hint becomes even faster when unrolling
the loop:

$ gcc -Ofast -funroll-loops b.c && ./a.out
13068230
time used: 254444202 cycles

another speed gain of around 15%.

This is for an AMD Ryzen 7 1700X.

Reply via email to