[Bug c/92172] New: ARM Thumb2 frame pointers inconsistent with clang and ARM-THUMB Procedure Call Standard

sethml at ofb dot net Mon, 21 Oct 2019 16:03:50 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92172


            Bug ID: 92172
           Summary: ARM Thumb2 frame pointers inconsistent with clang and
                    ARM-THUMB Procedure Call Standard
           Product: gcc
           Version: 8.3.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: sethml at ofb dot net
  Target Milestone: ---

This is a bit of a feature request, which has been rejected before, but I think
there are compelling reasons to reconsider.

The issue is described pretty well in this gcc-patches thread:
https://www.mail-archive.com/gcc-patches@gcc.gnu.org/msg195725.html

And in this clang bug:
https://bugs.llvm.org/show_bug.cgi?id=18505

The request is to provide an option to make gcc's frame pointer behavior
consistent with clang, either with a special flag, or by default.

The behavior of frame pointers on ARM is a mess, with AAPCS not defining it,
the obsolete ARM-Thumb Procedure Call Standard (ATPCS) recommdending a frame
layout different than GCC and clang, and ARM's obsolete armcc compiler
implementing different semantics.

However, as of 2014, ARM's standard toolchain is "ARM Compiler 6", which
packages clang:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.subset.swdev.comp6/index.html

The Keil embedded toolchain, which is pretty industry-standard for ARM embedded
development, uses armclang:
http://www.keil.com/support/man/docs/armclang_ref/armclang_ref_vvi1466179578564.htm

Addressing some of the objections to modifying the frame layout from the
gcc-patches thread:

Wilco Dijkstra
<https://www.mail-archive.com/gcc-patches@gcc.gnu.org/msg195782.html>:
> However changing the frame pointer like in the proposed patch
> will have a much larger cost - both in performance and codesize. You'd be 
> lucky if it is less than 10%. This is due to placing the frame pointer at the
> top rather than the bottom of the frame, and that is very inefficient in 
> Thumb-2.

I don't understand this objection. For a simple function the additional
overhead is literally nothing - for example <https://godbolt.org/z/BhvM2t>, GCC
generates:
        push    {r3, r4, r7, lr}
        add     r7, sp, #0
while clang adds a small constant to make r7 point to the previous r7 on the
stack, with lr immediately above - zero overhead:
        push    {r4, r6, r7, lr}
        add     r7, sp, #8
For a more complex function where the compiler has to spill r8-r11 one extra
instruction is required to generate the right frame layout - gcc generates:
        push    {r3, r4, r5, r6, r7, r8, r9, lr}
        add     r7, sp, #0
While clang generates:
        push    {r4, r5, r6, r7, lr}
        add     r7, sp, #12
        push.w  {r8, r9, r11}
Push (stmia) instructions take, at least on Cortex-M3, 1+N cycles, where N is
the number of registers saved. So clang's frame pointer approach takes one
extra cycle and 4 extra bytes.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0337e/BABBCJII.html

> Doing real unwinding is also far more accurate than frame pointer based
> unwinding (the latter doesn't handle leaf functions correctly, entry/exit in
> non-leaf functions and shrinkwrapped functions - and this breaks callgraph
> profiling).

This is true, but doing real unwinding is prohibitively expensive in an
embedded systems context, in which one has only hundreds of KiB of code storage
and RAM.

Richard Earnshaw
<https://www.mail-archive.com/gcc-patches@gcc.gnu.org/msg196444.html>:
> I object to another hack going in for another ill-specified frame
> pointer variant until such time as the ABI is updated to sort this out
> properly.
>
> So until the ABI sanctions a proper inter-function frame chain record,
> GCC will only support local use of the frame pointer and no chaining.

Since this is not defined by the ABI, the ABI is unlikely to specify it any
time soon. However, ARM seems to have blessed clang as the official ARM
compiler, so it's a defacto standard at this point.

Richard Earnshaw
<https://www.mail-archive.com/gcc-patches@gcc.gnu.org/msg196488.html>:
> On entry to a function the code has to save the existing frame register.
> It doesn't know (can't trivially know) whether the caller is code
> compiled in Arm state or Thumb state.  So how can it save the caller's
> frame register if they are not the same?
>
> Furthermore, the 'other' frame register (ie r7 in Arm state, r11 in
> Thumb) is available as a call-saved register, so can contain any random
> value.  If you try to use that random value during a frame chain walk
> your program will most like take an access violation.  It will certainly
> give you a garbage frame chain.

This is true - you cannot safely walk the stack frames if thumb and arm
functions are intermixed. However, for the situations in which this feature is
most useful this is not a problem. For deeply embedded codebases, the entire
codebase is compiled with a single compiler and instruction set. Most
microcontrollers use a Cortex-M instruction set, which doesn't even implement
ARM instructions, so by definition they will not be present!

Someone wrote something like:
> The extra overhead of frame pointers will remove the benefit of thumb 
> instructions - 
> why not just use ARM instructions?

As noted above, there exist many MCUs for which ARM mode is not implemented.

I have two applications motivating me to wanting this fixed. I'm working on
safety-critical firmware running on small microcontrollers.

1) In case of a crash, it would be extremely helpful to be able to have the
embedded firmware relay back a simple stack trace. Integrating libunwind and
including the unwind tables in our firmware is too heavyweight. We know the
boundaries of the stack, so it's easy to validate address when traversing
frames. If the stack trace sometimes ends early due to issues such as ARM/Thumb
interworking, we don't mind - it's much better than no trace at all.

2) It would be really helpful to have random sampling profiling, by capturing
stack traces from a randomly triggered timer interrupt handler. Full profiling
would add excessive overhead.

I'm totally willing to take a slight performance hit to get the two features
above. Judging from stackoverflow questions and such, there are others who
would like predictable frame pointers:
https://stackoverflow.com/questions/19643047/arm-call-stack-generation-with-no-frame-pointer
http://cplusadd.blogspot.com/2008/11/frame-pointers-and-function-call.html
https://gcc-help.gcc.gnu.narkive.com/D8BDrQzp/stack-backtrace-for-arm-thumb
https://github.com/google/sanitizers/issues/640

[Bug c/92172] New: ARM Thumb2 frame pointers inconsistent with clang and ARM-THUMB Procedure Call Standard

Reply via email to