The multiply-by-constant optimization work poorly for x86 targets with multiply units with short latency.
Try some small values M for a sample program like, long f (long x) { return x * M; } for e.g. the -mtune=k8 subtarget. One gets slow sequences of lea, add, sub, sal. So what's wrong? Is it expmed.c's synth_mult that doesn't measure costs correctly? Or does i386.c provide poor cost measures? I'd say that synth_mult's cost model is great for 3 operand machines, but not very good for x86. It does not see the additional mov instructions inserted in many of its most clever sequences, nor does it understand that small shifts are done with the 2 cycle lea instruction (when src!=dst). Additionally, synth_mult thinks a 10 operation long sequence that cost 999 is the way to go if a single mult costs 1000. It does not take sequence length into account at all. Perhaps it should? Let's look at some examples of generated code for -mtune=k8. 6: (11 bytes, >= 3 cycles) leaq (%rdi,%rdi), %rax salq $3, %rdi subq %rax, %rdi 10: (12 bytes, >= 4 cycles) leaq 0(,%rdi,8), %rax leaq (%rax,%rdi,2), %rax 11: (21 bytes, >= 4 cycles) leaq 0(,%rdi,4), %rdx movq %rdi, %rax salq $4, %rax subq %rdx, %rax subq %rdi, %rax 13: (21 bytes, >= 4 cycles) leaq 0(,%rdi,4), %rdx movq %rdi, %rax salq $4, %rax subq %rdx, %rax addq %rdi, %rax etc, etc. The cycle counts are only if we get ideal parallel execution, otherwise one additional cycle will be needed. The imul instruction needs 4 cycles (64-bit operation) and is not alignment sensitive and needs just one execution slot. It can therefore execute simultaneously with independent instructions, while the above sequences will use up much decode, execute, and retire resources. What can be done about this? The simple fix is to pretend multiplication is cheaper: --- /u/gcc/gcc-4.2.2/gcc/config/i386/.~/i386.c.~1~ Sat Sep 1 17:28:30 2007 +++ /u/gcc/gcc-4.2.2/gcc/config/i386/i386.c Thu Dec 13 10:12:07 2007 @@ -17254,7 +17254,7 @@ op0 = XEXP (op0, 0), mode = GET_MODE (op0); } - *total = (ix86_cost->mult_init[MODE_INDEX (mode)] + *total = (ix86_cost->mult_init[MODE_INDEX (mode)] - 1 + nbits * ix86_cost->mult_bit + rtx_cost (op0, outer_code) + rtx_cost (op1, outer_code)); This avoids most of the problems, but we still get a 4 cycle, 2 lea sequence for M = 10. Potential problem: This might affect other parts of the optimizer than synth_mult. That might be bad, but it might also be desirable. Another fix would perhaps be to teach synth_mult to understand that it's generating code for a 2.5 operand machine (one that can only do "a x= b", not "a = b x c", for some operation x). We should teach it that there will be moves inserted for sequences that rely on a source register twice (more or less). Letting synth_mult take sequence length into account would also make sense, I think. A cost of 1 per operation does not seem unreasonable. (I wrote synth_mult originally.) -- Summary: Multiply-by-constant pessimation Product: gcc Version: 4.2.2 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: tege-gcc at swox dot com GCC target triplet: i386-*-* http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34452