https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946
--- Comment #6 from Dave Rodgman <dave.rodgman at arm dot com> ---
Under clang, we see that mbedtls_xor being inlined, or not, causes an
equivalent perf difference. Note that mbedtls_xor is inline in the gcc O2
version and not in the gcc Os version.
Not inline mbedtls_xor, -Os clang:
AES-XTS-128 : 834549 KiB/s, 0 cycles/byte
AES-XTS-256 : 674383 KiB/s, 0 cycles/byte
Inline mbedtls_xor, -Os clang:
AES-XTS-128 : 2664799 KiB/s, 0 cycles/byte
AES-XTS-256 : 2278008 KiB/s, 0 cycles/byte
However, if I mark mbedtls_xor as static inline (actually, for testing
purposes, I created a static inline copy in aes.c), gcc still does not inline
it. I am not sure why. If I use "__attribute__((always_inline))" gcc will
inline it.
So it looks like gcc is overly averse to inlining this function, or is getting
the cost/benefit of inline-ing wrong here?
For 3/5 cases, we know at compile time that n == 16, so the function will
compile to four instructions:
139c: 3dc00021 ldr q1, [x1]
13a0: 3dc00040 ldr q0, [x2]
13a4: 6e211c00 eor v0.16b, v0.16b, v1.16b
13a8: 3d800000 str q0, [x0]
so it does seem surprising that gcc doesn't want to inline this.