https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740
Bug ID: 85740
Summary: Non-optimal determining maximum in a loop
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: enhancement
Priority: P3
Component: middle-end
Assignee: unassigned at gcc dot gnu.org
Reporter: tkoenig at gcc dot gnu.org
Target Milestone: ---
There is a problem with loops determining the maximum (or minimum)
of a value. I noticed this when looking at inline code generated
by gfortran for maxloc/minloc.
Consider the two programs which select the position of a
maximum from a rather large integer array. They are identical,
except that b.c uses __builtin_expect in the inner loop:
$ cat b.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <x86intrin.h>
#define N (1<<28)
int foo(int *a, int n)
{
int m, nm;
int i;
m = -20000;
nm = -1;
for (i=0; i<n; i++)
{
if (__builtin_expect (a[i] > m, 0))
{
m = a[i];
nm = i;
}
}
return nm;
}
int
main(int argc, char **argv) {
unsigned long long t1, t2;
int *p = malloc(N * sizeof(int));
for (int i=0; i<N; i++)
p[i] = random();
t1 = __rdtsc();
int res = foo(p, N);
t2 = __rdtsc();
printf("%d\n", res);
printf("time used: %llu cycles\n", t2-t1);
return 0;
}
$ cat c.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <x86intrin.h>
#define N (1<<28)
int foo(int *a, int n)
{
int m, nm;
int i;
m = -20000;
nm = -1;
for (i=0; i<n; i++)
{
if (a[i] > m)
{
m = a[i];
nm = i;
}
}
return nm;
}
int
main(int argc, char **argv) {
unsigned long long t1, t2;
int *p = malloc(N * sizeof(int));
for (int i=0; i<N; i++)
p[i] = random();
t1 = __rdtsc();
int res = foo(p, N);
t2 = __rdtsc();
printf("%d\n", res);
printf("time used: %llu cycles\n", t2-t1);
return 0;
}
$ gcc -Ofast b.c && ./a.out
13068230
time used: 281813216 cycles
$ gcc -Ofast c.c && ./a.out
13068230
time used: 492848802 cycles
So, the program using the __builtin_expect is faster by
a factor of around 1.75, quite significant.
The problem is obvious from the assembly. The inner loop has, for
the version without the hint,
.L6:
movq %rcx, %rdx
.L4:
movl (%rdi,%rdx,4), %ecx
cmpl %ecx, %esi
jge .L3
movl %edx, %eax
movl %ecx, %esi
.L3:
leaq 1(%rdx), %rcx
cmpq %rdx, %r8
jne .L6
.L1:
ret
so what is likely to be the most common case (not finding a higher
maximum) leads to a jump over two instrucitons. On the other hand,
the assembly for the version with the hint is
.L6:
movq %rcx, %rdx
.L4:
movl (%rdi,%rdx,4), %ecx
cmpl %ecx, %esi
jl .L8
.L3:
leaq 1(%rdx), %rcx
cmpq %rdx, %r8
jne .L6
.L1:
ret
.p2align 4,,10
.p2align 3
.L8:
movl %edx, %eax
movl %ecx, %esi
jmp .L3
so the control flow is not interrupted in the most common case.
The version with the hint becomes even faster when unrolling
the loop:
$ gcc -Ofast -funroll-loops b.c && ./a.out
13068230
time used: 254444202 cycles
another speed gain of around 15%.
This is for an AMD Ryzen 7 1700X.