Consider the following code: int (*indirect_func)();
int indirect_call() { return indirect_func(); } gcc 4.4.0 generates the following with -O2 -mcpu=cortex-a8 -S: indirect_call: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 movw r3, #:lower16:indirect_func stmfd sp!, {r4, lr} movt r3, #:upper16:indirect_func mov lr, pc ldr pc, [r3, #0] ldmfd sp!, {r4, pc} The problem is that the instruction "ldr pc, [r3, #0]" is not considered a function call by the Cortex-A8's branch predictor, as noted in DDI0344J section 5.2.1, Return stack predictions. Thus, the return from the called function is mispredicted resulting in a penalty of 13 cycles compared to a direct call. Rather than doing mov lr, pc ldr pc, [r3] it should instead use the blx instruction as so: ldr lr, [r3] blx lr which is considered a function call by the branch predictor, and has an overhead of only one cycle compared to a direct call. gcc -v: Using built-in specs. Target: arm-none-linux-gnueabi Configured with: ../gcc-4.4.0/configure --target=arm-none-linux-gnueabi --prefix=/usr/local/arm --enable-threads --with-sysroot=/usr/local/arm/arm-none-linux-gnueabi/libc Thread model: posix gcc version 4.4.0 (GCC) -- Summary: GCC generates suboptimal code for indirect function calls on ARM Product: gcc Version: 4.4.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: lessen42 at gmail dot com GCC host triplet: i386-apple-darwin GCC target triplet: arm-none-linux-gnueabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40887