Package: gcc-3.4 Version: 3.4.1-7.0.0.1.amd64 Severity: wishlist compiling this function: double baz(double foo, double bar) { return foo*foo*foo*foo*bar*bar*bar*bar; }
on amd64 with -O6 -ffast-math, gcc emits this code: foo.o: file format elf64-x86-64 Disassembly of section .text: ... (some similar functions that I was messing around with) ... 0000000000000050 <ddbar>: 50: f2 0f 59 c0 mulsd %xmm0,%xmm0 54: f2 0f 59 c0 mulsd %xmm0,%xmm0 58: f2 0f 59 c1 mulsd %xmm1,%xmm0 5c: f2 0f 59 c1 mulsd %xmm1,%xmm0 60: f2 0f 59 c1 mulsd %xmm1,%xmm0 64: f2 0f 59 c1 mulsd %xmm1,%xmm0 68: c3 retq So, it notices that it can do foo*foo*foo*foo with two mulsd instructions, but it misses the same optimization for bar*bar*bar*bar. It would save one FP multiply overall to do: mulsd %xmm0, %xmm0 mulsd %xmm1, %xmm1 mulsd %xmm0, %xmm0 mulsd %xmm1, %xmm1 mulsd %xmm1, %xmm0 retq Also, the two non-dependent muls could run in parallel. Without -ffast-math, of course, gcc can't take advantage of the laws of arithmetic like that and has to do all the multiplies the straightforward way. Anyway, that's what I noticed while poking around waiting for pure64 to download from alioth to this fancy new dual Opteron I'm setting up for one of my users :) Correct me if I'm all wrong about this optimization being possible... -- System Information: Debian Release: 3.1 Architecture: amd64 (x86_64) Kernel: Linux 2.6.8-1-amd64-k8-smp Locale: LANG=C, LC_CTYPE=C Versions of packages gcc-3.4 depends on: ii binutils 2.15-1 The GNU assembler, linker and bina ii cpp-3.4 3.4.1-7.0.0.1.amd64 The GNU C preprocessor ii gcc-3.4-base 3.4.1-7.0.0.1.amd64 The GNU Compiler Collection (base ii libc6 2.3.2.ds1-16.0.0.1.amd64 GNU C Library: Shared libraries an ii libgcc1 1:3.4.1-7.0.0.1.amd64 GCC support library -- no debconf information