https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92172

--- Comment #2 from Seth LaForge <sethml at ofb dot net> ---
Good point on frame pointers vs a frame chain for unwinding. I'm looking for
the unwindable frame chain.

Wilco:
> Why does this matter? Well as your examples show, if you want to emit a frame
> chain using standard push/pop, it typically ends up pointing to the top of the
> frame. That is the worst possible position for a frame pointer on Thumb - 
> while
> Arm supports negative immediate offsets up to 4KB, Thumb-1 doesn't support
> negative offsets at all, and Thumb-2 supports offsets up to -255 but only with
> 32-bit instructions. So the result of conflating the frame chain and frame
> pointer implies a terrible codesize hit for Thumb.

Well, there's really no need for a frame pointer for efficiency, since the
stack frame can be efficiently accessed with positive immediate accesses
relative to the stack pointer. There are even special encodings for Thumb-2
16-bit LDR/STR which allow an immediate offset of 0 to 1020 when relative to SP
- much larger than other registers. You're saying using a frame pointer implies
a terrible codesize hit for Thumb, but I don't see how that can be - stack
access will continue to go through SP, and the only code size hit should be
pushing/popping R7 (~2 cycles), computing R7 as a frame pointer (~1 cycle), and
potential register spills due to one less register available. That's a pretty
small amount of overhead for a non-leaf function.

> Your examples suggest LLVM suffers from both of these issues, and IIRC it 
> still
> uses r11 on Arm but r7 on Thumb. That is way too inefficient/incorrect to 
> consider
> a defacto standard.

Using R11 in ARM and R7 on Thumb is mandated by the AAPCS I believe. I don't
think the overhead is likely to be particularly different in Thumb vs ARM.

Numbers talk, so I collected some benchmarks on some production firmware used
in self-driving cars. This code is executing on a Cortex-R5F MCU, processing a
large amount of data, with a wide variety of function sizes. Unfortunately
precise benchmarking on this MCU is difficult - there seem to be swings of a
few percent in performance due to changes in code alignment, but the rough
results have been reliable. I'm collecting .text size and time spent in
computation. Unfortunately we're using a pretty old version of gcc, but the
frame pointer generation doesn't seem to have changed in newer releases.

Baseline: With gcc 4.7, -fomit-frame-pointer, -mthumb: 384016 bytes, 110.943 s.
With gcc 4.7, -fno-omit-frame-pointer, -mthumb: 396688 bytes, 113.539 s.
This shows a +3.2% size overhead and +2.3% time overhead for enabling frame
pointers in Thumb-2 code.

With gcc 4.7, -fomit-frame-pointer, ARM mode: 487152 bytes, 113.874 s.
That's +26.9% size and +2.6% time over -mthumb.
With gcc 4.7, -fno-omit-frame-pointer, ARM mode: 498064 bytes, 116.936 s.
This shows a +2.7% size overhead and +2.7% time overhead for enabling frame
pointers in Thumb-2 code.
Within margin of error, it appears the frame pointer overhead is comparable in
Thumb-2 and ARM code.

With clang 7, -fomit-frame-pointer, -mthumb: 371008 bytes, 107.072 s.
That's -3.4% size and -3.5% time over gcc 4.7.
With clang 7, -fomit-frame-pointer, -mthumb: 377296 bytes, 110.868 s.
This shows a +1.7% size overhead and +3.5% time overhead for enabling frame
pointers in Thumb-2 code for clang 7.
Within margin of error, it appears clang's frame pointer overhead is slightly
higher than gcc's for Thumb-2, but not much.

With clang 7, -fomit-frame-pointer, ARM mode: 458592 bytes, 112.829 s.
That's +21.5% size +1.8% time over clang -mthumb.
With clang 7, -fno-omit-frame-pointer, ARM mode: 463440 bytes, 111.796 s.
That's +1.1% size -0.9% time over clang ARM without frame pointers. I'm a bit
mystified by this result - I looked at the generated code and it does what I'd
expect, so I think this is just benchmarking variation due to caches/alignment.

For my application, a ~2.5% performance hit is very worthwhile to gain the
extra debugability of easy stack traces. I'll probably end up switching over to
clang and frame pointers. It'd be nice if people using gcc for embedded ARM
development had an easy option for generating stack traces.

Reply via email to