https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107718
Bug ID: 107718
Summary: clang optimizes TSVC s317 a lot better
Product: gcc
Version: 13.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: middle-end
Assignee: unassigned at gcc dot gnu.org
Reporter: hubicka at gcc dot gnu.org
Target Milestone: ---
This is a stupid benchmark but still...
jh@alberti:~/tsvc/bin> more tt2.c
typedef double real_t;
#define iterations 100000
#define LEN_1D 32000
#define LEN_2D 256
real_t a[LEN_1D],b[LEN_1D],c[LEN_1D],d[LEN_1D],e[LEN_1D];
real_t qq;
int
main(void)
{
real_t q;
for (int nl = 0; nl < 5*iterations; nl++) {
q = (real_t)1.;
for (int i = 0; i < LEN_1D/2; i++) {
q *= (real_t).99;
}
qq+=q;
}
return q;
}
jh@alberti:~/tsvc/bin> time ./a.out
real 0m0.805s
user 0m0.805s
sys 0m0.000s
jh@alberti:~/tsvc/bin> clang -Ofast -march=native tt2.c
jh@alberti:~/tsvc/bin> time ./a.out
real 0m0.010s
user 0m0.007s
sys 0m0.003s
Clang does:
.LBB0_2: # Parent Loop BB0_1 Depth=1
# => This Inner Loop Header: Depth=2
vmulpd %zmm2, %zmm3, %zmm3
vmulpd %zmm2, %zmm4, %zmm4
vmulpd %zmm2, %zmm5, %zmm5
vmulpd %zmm2, %zmm6, %zmm6
addl $-3200, %ecx # imm = 0xF380
jne .LBB0_2
# %bb.3: # in Loop: Header=BB0_1 Depth=1
vmulpd %zmm3, %zmm4, %zmm3
So it runs multiplications and because of unrolling combines the exponent?