[Bug other/110946] 3x perf regression with -Os on M1 Pro

dave.rodgman at arm dot com via Gcc-bugs Tue, 08 Aug 2023 06:27:09 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946


--- Comment #6 from Dave Rodgman <dave.rodgman at arm dot com> ---
Under clang, we see that mbedtls_xor being inlined, or not, causes an
equivalent perf difference. Note that mbedtls_xor is inline in the gcc O2
version and not in the gcc Os version.

Not inline mbedtls_xor, -Os clang:
  AES-XTS-128              :     834549 KiB/s,          0 cycles/byte
  AES-XTS-256              :     674383 KiB/s,          0 cycles/byte

Inline mbedtls_xor, -Os clang:
  AES-XTS-128              :    2664799 KiB/s,          0 cycles/byte
  AES-XTS-256              :    2278008 KiB/s,          0 cycles/byte


However, if I mark mbedtls_xor as static inline (actually, for testing
purposes, I created a static inline copy in aes.c), gcc still does not inline
it. I am not sure why. If I use "__attribute__((always_inline))" gcc will
inline it.

So it looks like gcc is overly averse to inlining this function, or is getting
the cost/benefit of inline-ing wrong here?

For 3/5 cases, we know at compile time that n == 16, so the function will
compile to four instructions:

    139c:       3dc00021        ldr     q1, [x1]
    13a0:       3dc00040        ldr     q0, [x2]
    13a4:       6e211c00        eor     v0.16b, v0.16b, v1.16b
    13a8:       3d800000        str     q0, [x0]

so it does seem surprising that gcc doesn't want to inline this.

[Bug other/110946] 3x perf regression with -Os on M1 Pro

Reply via email to