[Bug rtl-optimization/92712] New: Performance regression with assumed values
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92712 Bug ID: 92712 Summary: Performance regression with assumed values Product: gcc Version: 9.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: mike.k at digitalcarbide dot com Target Milestone: --- The following code generates progressively worse code from GCC 7.5 to GCC 8.3 to GCC 9.1 (and trunk): static void func_base(int t, const int v) { int x = 0; for (int i = 0; i < t; ++i) { x += v; } volatile int d = x; } void func_default(int t, const int v) { func_base(t, v); } void func_assumed(int t, const int v) { if (t < 0) __builtin_unreachable(); func_base(t, v); } On GCC 7.5 (-O2): func_default(int, int): test edi, edi jle .L3 imul edi, esi mov DWORD PTR [rsp-4], edi ret .L3: xor edi, edi mov DWORD PTR [rsp-4], edi ret func_assumed(int, int): imul edi, esi mov DWORD PTR [rsp-4], edi ret On GCC 8.3 (-O2): func_default(int, int): test edi, edi jle .L3 imul edi, esi mov DWORD PTR [rsp-4], edi ret .L3: xor edi, edi mov DWORD PTR [rsp-4], edi ret func_assumed(int, int): test edi, edi je .L6 imul edi, esi .L6: mov DWORD PTR [rsp-4], edi ret On GCC 9.1 and trunk (-O2): func_default(int, int): test edi, edi jle .L3 sub edi, 1 imul edi, esi add esi, edi mov DWORD PTR [rsp-4], esi ret .L3: xor esi, esi mov DWORD PTR [rsp-4], esi ret func_assumed(int, int): test edi, edi je .L6 sub edi, 1 imul edi, esi add edi, esi .L6: mov DWORD PTR [rsp-4], edi ret This occurs regardless of if `func_base` is allowed to inline, or if it is manually inlined. It does not occur in LLVM-Clang or in Microsoft Visual C++.
[Bug rtl-optimization/93605] New: GCC suboptimal tail call optimization in trivial function forwarding with __attribute__((noinline))
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93605 Bug ID: 93605 Summary: GCC suboptimal tail call optimization in trivial function forwarding with __attribute__((noinline)) Product: gcc Version: 9.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: mike.k at digitalcarbide dot com Target Milestone: --- In a trivial function-forwarder where `__attribute__((noinline))` is specified on the forwardee, an extra `movzx` instruction is generated (on x86-64) prior to the tail call. This does not occur on Clang. Observe (https://godbolt.org/z/kFGCpW): ``` namespace impl { __attribute__((noinline)) static int func (bool v, int a, int b) { return v ? a/b : b/a; } } int func(bool v, int a, int b) { return impl::func(v, a, b); } ``` On all tested versions (trunk (10) to GCC 4), this produces the following assembly for `func`: ``` func(bool, int, int): movzx edi, dil jmp impl::func(bool, int, int) ``` On Clang trunk (10) until Clang 5.0.0, this produces the following assembly for `func`: ``` func(bool, int, int): # @func(bool, int, int) jmp impl::func(bool, int, int) # TAILCALL ``` Clang 5.0.0 and below produce identical assembly to GCC: ``` func(bool, int, int): # @func(bool, int, int) movzx edi, dil jmp impl::func(bool, int, int) # TAILCALL ```
[Bug target/93605] GCC suboptimal tail call optimization in trivial function forwarding with __attribute__((noinline))
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93605 --- Comment #2 from mike.k at digitalcarbide dot com --- Interestingly, changing `impl::func`'s signature from `bool v` to `auto&& v` fixes the issue. Changing it to `auto v` does not.
[Bug middle-end/91459] New: Tail-Call Optimization is not performed when return value is assumed.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91459 Bug ID: 91459 Summary: Tail-Call Optimization is not performed when return value is assumed. Product: gcc Version: 9.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: mike.k at digitalcarbide dot com Target Milestone: --- In situations where a function either returns a specific value or does not return at all, GCC fails to perform tail call optimizations. This appears to occur on all GCC versions with -O1, -O2, -O3, and -Os. It occurs with both the C and C++ front-ends. Observe: /* This function is guaranteed to only return the value '1', else it does not return. // This is meant to emulate a function such as 'exec'. */ extern int function_returns_only_1_or_doesnt_return(int, int); int foo1(int a, int b) { const int result = function_returns_only_1_or_doesnt_return(a, b); if (result == 1) { return result; } else { __builtin_unreachable(); } } int foo2(int a, int b) { return function_returns_only_1_or_doesnt_return(a, b); } This results in the following output for -O3 on x86-64: foo1(int, int): push rax call function_returns_only_1_or_doesnt_return(int, int) mov eax, 1 pop rdx ret foo3(int, int): jmp function_returns_only_1_or_doesnt_return(int, int) While the behavior is correct, the tail-call optimization is far more optimal and preserves the same semantics. The same behavior occurs with other architectures as well, so it does not appear to be a back-end issue.
[Bug middle-end/91459] Tail-Call Optimization is not performed when return value is assumed.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91459 --- Comment #1 from mike.k at digitalcarbide dot com --- 'foo3' in the assembly output should be 'foo2'. I'd changed the function name in my test code and did not update the assembly. Apologies.
[Bug c++/82658] New: Suboptimal codegen on AVR when right-shifting 8-bit unsigned integers.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82658 Bug ID: 82658 Summary: Suboptimal codegen on AVR when right-shifting 8-bit unsigned integers. Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: mike.k at digitalcarbide dot com Target Milestone: --- This issue has been validated to occur back as far as at least 5.4.0, and still occurs in trunk. When shifting an unsigned char/uint8_t right by less than 4 bits, suboptimal code is generated. This behavior only occurs when compiling source files as C++, not as C, even when the source file is equivalent otherwise. The issue does not manifest with left shifts or with larger composite types (such as uint16_t). Trivial test: void test () { volatile unsigned char val; unsigned char local = val; local >>= 1; val = local; } Compiling as C++ (avr-g++ [-O3|-O2] -mmcu=atmega2560 test.cpp -S -c -o test.s) results in the following assembly sequence handling the load, shift, and store: ldd r24,Y+1 ldi r25,0 asr r25 ror r24 std Y+1,r24 The next operation performed on r25 is a clr. Thus, ldi/asr/ror are entirely equivalent to lsr in this situation, which is what the C frontend does: Compiling as C (avr-gcc [-O3|-O2] -mmcu=atmega2560 test.c -S -c -o test.s) results in the following assembly sequence handling the load, shift, and store: ldd r24,Y+1 lsr r24 std Y+1,r24 This is optimal code. This is also the defined behavior in avr.c. The issue becomes more problematic with larger shifts (up until 4, where the defined behavior takes over again), as it generates the same instruction sequence repeatedly, whereas gcc simply generates 'lsr; lsr; lsr', as expected. Interestingly, the issue does _not_ manifest if one chooses to use an integer division instead of a shift - if one divides the unsigned char by 2 instead of shifting right 1, it emits 'lsr' as expected.
[Bug middle-end/82658] Suboptimal codegen on AVR when right-shifting 8-bit unsigned integers.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82658 --- Comment #2 from mike.k at digitalcarbide dot com --- I wanted to validate if this issue was presenting in the toolchains for other architectures, so I tested a bit: GCC 7.2.0 on x86-64 (-O3): C: movzx eax, BYTE PTR [rsp-1] shr al mov BYTE PTR [rsp-1], al ret C++: movzx eax, BYTE PTR [rsp-1] sar eax mov BYTE PTR [rsp-1], al ret While not different in performance, it _is_ generating different code, and the code difference seems to reflect what Richard already found. I am not able to reproduce any difference on MIPS64, MIPS32, ARM, ARM64, PPC, PPC64. This is probably due to backend differences not causing the sequences to map differently. I do see it going back to GCC 4.6.4 on AVR.