https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740
Bug ID: 85740 Summary: Non-optimal determining maximum in a loop Product: gcc Version: unknown Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: tkoenig at gcc dot gnu.org Target Milestone: --- There is a problem with loops determining the maximum (or minimum) of a value. I noticed this when looking at inline code generated by gfortran for maxloc/minloc. Consider the two programs which select the position of a maximum from a rather large integer array. They are identical, except that b.c uses __builtin_expect in the inner loop: $ cat b.c #include <stdio.h> #include <stdlib.h> #include <time.h> #include <x86intrin.h> #define N (1<<28) int foo(int *a, int n) { int m, nm; int i; m = -20000; nm = -1; for (i=0; i<n; i++) { if (__builtin_expect (a[i] > m, 0)) { m = a[i]; nm = i; } } return nm; } int main(int argc, char **argv) { unsigned long long t1, t2; int *p = malloc(N * sizeof(int)); for (int i=0; i<N; i++) p[i] = random(); t1 = __rdtsc(); int res = foo(p, N); t2 = __rdtsc(); printf("%d\n", res); printf("time used: %llu cycles\n", t2-t1); return 0; } $ cat c.c #include <stdio.h> #include <stdlib.h> #include <time.h> #include <x86intrin.h> #define N (1<<28) int foo(int *a, int n) { int m, nm; int i; m = -20000; nm = -1; for (i=0; i<n; i++) { if (a[i] > m) { m = a[i]; nm = i; } } return nm; } int main(int argc, char **argv) { unsigned long long t1, t2; int *p = malloc(N * sizeof(int)); for (int i=0; i<N; i++) p[i] = random(); t1 = __rdtsc(); int res = foo(p, N); t2 = __rdtsc(); printf("%d\n", res); printf("time used: %llu cycles\n", t2-t1); return 0; } $ gcc -Ofast b.c && ./a.out 13068230 time used: 281813216 cycles $ gcc -Ofast c.c && ./a.out 13068230 time used: 492848802 cycles So, the program using the __builtin_expect is faster by a factor of around 1.75, quite significant. The problem is obvious from the assembly. The inner loop has, for the version without the hint, .L6: movq %rcx, %rdx .L4: movl (%rdi,%rdx,4), %ecx cmpl %ecx, %esi jge .L3 movl %edx, %eax movl %ecx, %esi .L3: leaq 1(%rdx), %rcx cmpq %rdx, %r8 jne .L6 .L1: ret so what is likely to be the most common case (not finding a higher maximum) leads to a jump over two instrucitons. On the other hand, the assembly for the version with the hint is .L6: movq %rcx, %rdx .L4: movl (%rdi,%rdx,4), %ecx cmpl %ecx, %esi jl .L8 .L3: leaq 1(%rdx), %rcx cmpq %rdx, %r8 jne .L6 .L1: ret .p2align 4,,10 .p2align 3 .L8: movl %edx, %eax movl %ecx, %esi jmp .L3 so the control flow is not interrupted in the most common case. The version with the hint becomes even faster when unrolling the loop: $ gcc -Ofast -funroll-loops b.c && ./a.out 13068230 time used: 254444202 cycles another speed gain of around 15%. This is for an AMD Ryzen 7 1700X.