https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773
--- Comment #9 from PeteVine <tulipawn at gmail dot com> --- It seems the LPATHBench exhibits the same issue. https://raw.githubusercontent.com/logicchains/LPATHBench/master/c_fast.c compiled the following way: gcc -falign-functions=32 -std=gnu99 -O2 -mcpu=cortex-a5 -fomit-frame-pointer -mfpu=neon -ftree-vectorize -ffast-math c_fast.c -o c_fast is faster than a profiled version. (10 runs avg. shows about 4% slowdown) Once again division is present in the profiled assembly: bl __aeabi_idiv