https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773
--- Comment #12 from PeteVine <tulipawn at gmail dot com> --- It even reproduces the following way: I built an instrumented ARMv7 binary natively, ran it on a Cortex-A53, copied the gcda file back, recompiled with -fprofile-use and got the same 20% slowdown. Surely, that must count (pun intended) for something, as both CPU's are in-order designs.