A simple function that just sums over a vector is much slower if inlined than out of line. The o-o-l version keeps the sum in a xmm register, the inline version keeps reading and storing the stack variable on each iteration (guessed looking at the assembler).
Timings on a 2.4 P4 Xeon: out-of line: T0: 3117.44 ms T1: 653.93 ms inline: T0: 3097.05 ms T1: 3104.18 ms -- Summary: Inline code performance much worse than out-of-line Product: gcc Version: 4.1.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: jamagallon at ono dot com GCC target triplet: i586-mandriva-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31396