https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92172
--- Comment #2 from Seth LaForge <sethml at ofb dot net> --- Good point on frame pointers vs a frame chain for unwinding. I'm looking for the unwindable frame chain. Wilco: > Why does this matter? Well as your examples show, if you want to emit a frame > chain using standard push/pop, it typically ends up pointing to the top of the > frame. That is the worst possible position for a frame pointer on Thumb - > while > Arm supports negative immediate offsets up to 4KB, Thumb-1 doesn't support > negative offsets at all, and Thumb-2 supports offsets up to -255 but only with > 32-bit instructions. So the result of conflating the frame chain and frame > pointer implies a terrible codesize hit for Thumb. Well, there's really no need for a frame pointer for efficiency, since the stack frame can be efficiently accessed with positive immediate accesses relative to the stack pointer. There are even special encodings for Thumb-2 16-bit LDR/STR which allow an immediate offset of 0 to 1020 when relative to SP - much larger than other registers. You're saying using a frame pointer implies a terrible codesize hit for Thumb, but I don't see how that can be - stack access will continue to go through SP, and the only code size hit should be pushing/popping R7 (~2 cycles), computing R7 as a frame pointer (~1 cycle), and potential register spills due to one less register available. That's a pretty small amount of overhead for a non-leaf function. > Your examples suggest LLVM suffers from both of these issues, and IIRC it > still > uses r11 on Arm but r7 on Thumb. That is way too inefficient/incorrect to > consider > a defacto standard. Using R11 in ARM and R7 on Thumb is mandated by the AAPCS I believe. I don't think the overhead is likely to be particularly different in Thumb vs ARM. Numbers talk, so I collected some benchmarks on some production firmware used in self-driving cars. This code is executing on a Cortex-R5F MCU, processing a large amount of data, with a wide variety of function sizes. Unfortunately precise benchmarking on this MCU is difficult - there seem to be swings of a few percent in performance due to changes in code alignment, but the rough results have been reliable. I'm collecting .text size and time spent in computation. Unfortunately we're using a pretty old version of gcc, but the frame pointer generation doesn't seem to have changed in newer releases. Baseline: With gcc 4.7, -fomit-frame-pointer, -mthumb: 384016 bytes, 110.943 s. With gcc 4.7, -fno-omit-frame-pointer, -mthumb: 396688 bytes, 113.539 s. This shows a +3.2% size overhead and +2.3% time overhead for enabling frame pointers in Thumb-2 code. With gcc 4.7, -fomit-frame-pointer, ARM mode: 487152 bytes, 113.874 s. That's +26.9% size and +2.6% time over -mthumb. With gcc 4.7, -fno-omit-frame-pointer, ARM mode: 498064 bytes, 116.936 s. This shows a +2.7% size overhead and +2.7% time overhead for enabling frame pointers in Thumb-2 code. Within margin of error, it appears the frame pointer overhead is comparable in Thumb-2 and ARM code. With clang 7, -fomit-frame-pointer, -mthumb: 371008 bytes, 107.072 s. That's -3.4% size and -3.5% time over gcc 4.7. With clang 7, -fomit-frame-pointer, -mthumb: 377296 bytes, 110.868 s. This shows a +1.7% size overhead and +3.5% time overhead for enabling frame pointers in Thumb-2 code for clang 7. Within margin of error, it appears clang's frame pointer overhead is slightly higher than gcc's for Thumb-2, but not much. With clang 7, -fomit-frame-pointer, ARM mode: 458592 bytes, 112.829 s. That's +21.5% size +1.8% time over clang -mthumb. With clang 7, -fno-omit-frame-pointer, ARM mode: 463440 bytes, 111.796 s. That's +1.1% size -0.9% time over clang ARM without frame pointers. I'm a bit mystified by this result - I looked at the generated code and it does what I'd expect, so I think this is just benchmarking variation due to caches/alignment. For my application, a ~2.5% performance hit is very worthwhile to gain the extra debugability of easy stack traces. I'll probably end up switching over to clang and frame pointers. It'd be nice if people using gcc for embedded ARM development had an easy option for generating stack traces.