https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946
--- Comment #6 from Dave Rodgman <dave.rodgman at arm dot com> --- Under clang, we see that mbedtls_xor being inlined, or not, causes an equivalent perf difference. Note that mbedtls_xor is inline in the gcc O2 version and not in the gcc Os version. Not inline mbedtls_xor, -Os clang: AES-XTS-128 : 834549 KiB/s, 0 cycles/byte AES-XTS-256 : 674383 KiB/s, 0 cycles/byte Inline mbedtls_xor, -Os clang: AES-XTS-128 : 2664799 KiB/s, 0 cycles/byte AES-XTS-256 : 2278008 KiB/s, 0 cycles/byte However, if I mark mbedtls_xor as static inline (actually, for testing purposes, I created a static inline copy in aes.c), gcc still does not inline it. I am not sure why. If I use "__attribute__((always_inline))" gcc will inline it. So it looks like gcc is overly averse to inlining this function, or is getting the cost/benefit of inline-ing wrong here? For 3/5 cases, we know at compile time that n == 16, so the function will compile to four instructions: 139c: 3dc00021 ldr q1, [x1] 13a0: 3dc00040 ldr q0, [x2] 13a4: 6e211c00 eor v0.16b, v0.16b, v1.16b 13a8: 3d800000 str q0, [x0] so it does seem surprising that gcc doesn't want to inline this.