[Bug target/81614] New: x86 optimizer combines results of comparisons in a way that risks partial register stalls
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81614 Bug ID: 81614 Summary: x86 optimizer combines results of comparisons in a way that risks partial register stalls Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: cody at codygray dot com Target Milestone: --- Target: i?86-*-* Consider the following code: bool foo(int a, int b, int c) { // It doesn't matter if this short-circuits ('||' vs. '|') // because the optimizer treats them as equivalent. return (a == c || b == c); } All versions of GCC (going back to at least 4.4.7 and forward to the current 8.0 preview) translate this to the following optimized assembly on x86 targets: foo(int, int, int): movl12(%esp), %edx cmpl%edx, 4(%esp) sete%al cmpl8(%esp), %edx sete%dl orl %edx, %eax ret The problem here is the second-to-last instruction. It ORs together two full 32-bit registers, even though the preceding SETE instructions only set the low 8 bits of each register. This results in a speed-zapping phenomenon on virtually all x86 processors called a *partial register stall*. (See http://www.agner.org/optimize/microarchitecture.pdf for details on exactly how this is a performance problem on various implementations of x86. Although there are differences in exactly *why* it is a speed penalty, it virtually always is and *certainly* should be considered one when the output is tuned for a generic x86 target.) You get the same results at all optimization levels, including -Os (at least, the relevant portion of the code is the same). You also see this for x86-64 targets: foo(int, int, int): cmpl%edx, %edi sete%al cmpl%esi, %edx sete%dl orl %edx, %eax ret One of two things should be done instead: either (A) perform the bitwise operation *only* on the low bytes, or (B) pre-zero the entire 32-bit register *before* setting its low byte to break dependencies. Proposed Resolution A (use only low bytes): foo(int, int, int): movl12(%esp), %edx cmpl%edx, 4(%esp) sete%al cmpl8(%esp), %edx sete%dl orl %dl, %al ret Proposed Resolution B (pre-zero to break dependencies): foo(int, int, int): movl12(%esp), %edx xorl%eax, %eax cmpl%edx, 4(%esp) sete%al xorl%ecx, %ecx cmpl8(%esp), %edx sete%cl orl %ecx, %eax ret Approach A is the one used by Clang and MSVC. It solves the problem of partial register stalls while avoiding the need for a third register as in Approach B. The disadvantage of Approach A is that it creates only a byte-sized (8-bit) result. This is perfectly fine if the function returns a bool, but doesn't work if the function returns an integer type. There are two ways to solve that. What GCC currently does if you change foo() to return int is add a MOVZBL instruction between the OR and RET: foo(int, int, int): movl12(%esp), %edx cmpl%edx, 4(%esp) sete%al cmpl8(%esp), %edx sete%dl orl %edx, %eax movzbl %al, %eax ret This zero-extends the result in AL into EAX. (Notice that the partial register stall hazard is still there.) This existing behavior could simply be maintained. However, it would be more optimal to pre-zero as shown in Approach B. (For details on why this would be more optimal on all x86 microarchitectures, see here: https://stackoverflow.com/a/33668295).
[Bug target/81614] Should -mtune-ctrl=partial_reg_stall be turned by default?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81614 --- Comment #3 from Cody Gray --- (In reply to Uroš Bizjak from comment #1) > Partial register stalls were discussed many times in the past, but > apparently the compiler still produces fastest code when partial register > stalls are enabled on latest target processors (e.g. -mtune=intel). I don't understand what that means. -mtune=intel does *not* fix the partial register stall problem. It should. All Intel CPUs prior to Haswell would absolutely experience partial register stalls on this code, resulting in a performance degradation. -mtune-ctrl=partial_reg_stall does get the correct code, but I wasn't aware of this option and I believe I shouldn't have to be. If a developer is getting sub-optimal code even when he is asking the compiler to tune for his specific microarchitecture, then the optimizer has a bug. This is not an issue where there are arguments on either side. There is absolutely no benefit to generating the code that the compiler currently does. It is the same number of bytes to OR the BYTE-sized registers as it is to OR the DWORD-sized registers, while the former will run faster on the vast majority of CPUs and won't be any slower on the others. > Also, it is hard to confirm tuning PRs without hard benchmark data. No, it really isn't. I know that's a canned response, likely brought about by hard-won experience with a lot of dubious "tuning" feature requests, but it's just a cop-out in this case, if not outright dismissive. Partial register stalls are a well-documented phenomenon, confirmed by multiple sources, and have been a significant source of performance degradation since the Pentium Pro was released circa 1995. Agner Fog's manuals, as cited above, are really the authoritative reference when it comes to performance tuning on x86, and they provide confirmation of this in spades. In fact, I would argue that an accurate conceptual understanding of the microarchitecture is often a better guide than one-off microbenchmarks, since the latter are so difficult to craft and therefore so often misleading. For example, the effects of the stall might be masked by the overhead of the function call, but when the code is inlined or *certainly* when it is executed within an inner loop, there will be a significant performance degradation. Again, if this were an issue where I was proposing bloating the size of the code for a small payoff in speed, I could see how you might be skeptical. But there is literally no downside to making this change. You could possibly argue that -mtune-ctrl=partial_reg_stall should not be turned on when tuning for Haswell and later microarchitectures, as Haswell was the first to alleviate the visible performance penalties associated with reading from a full 32-bit register after writing to a partial 8-bit "view" of that same register. However, this applies *only* to the low-byte register (e.g., AL, CL, DL, etc.). With the high-byte registers (e.g., AH, CH, DH, etc.), there is still a loss in performance because an extra µop has to be inserted between the write to the 8-bit register and the read from the 32-bit register. This increases the latency by one clock cycle, and so unless the xH partial registers are treated differently from the xL partial registers, applying the optimizations described would still result in a performance win, especially since there is no drawback.
[Bug target/81274] New: x86 optimizer emits unnecessary LEA instruction when using AVX intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81274 Bug ID: 81274 Summary: x86 optimizer emits unnecessary LEA instruction when using AVX intrinsics Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: cody at codygray dot com Target Milestone: --- Target: i?86-*-* When AVX intrinsics are used in a function, the x86-32 optimizer emits unnecessary LEA instructions that clobber a register, forcing it to be preserved at additional expense. Test Code: -- #include __m256 foo(const float *x) { __m256 ymmX = _mm256_load_ps(&x[0]); return _mm256_addsub_ps(ymmX, ymmX); } Compile with: "-m32 -mtune=generic -mavx -O2" This is also reproduced at -O1 and -O3, and when tuning for any architecture that supports AVX (not specific to the "generic" target). It also does not matter whether the code is compiled in C or C++ mode. This behavior is exhibited by *all* versions of GCC that support AVX targeting, from at least 4.9.0 through the 8.0.0 (20170701). The code compiles warning-free, of course. See it live on Godbolt: https://godbolt.org/g/NDDgsA Actual Disassembly: --- foo:# -O2 or -O3 pushl %ecx movl 8(%esp), %eax leal 8(%esp), %ecx vmovaps(%eax), %ymm0 popl %ecx vaddsubps %ymm0, %ymm0, %ymm0 ret The LEA instruction performs a redundant load of the parameter from the stack into ECX, and then promptly discards that value. The load of ECX also has spill-over effects, requiring that additional code be emitted to preserve the original value of this register (PUSH+POP). The same bug is observed at -O1, but the ordering of the instructions is slightly different and the load of ECX is actually used to load EAX, further lengthening the dependency chain for no benefit whatsoever. foo:# -O1 pushl %ecx leal 8(%esp), %ecx movl (%ecx), %eax vmovaps(%eax), %ymm0 vaddsubps %ymm0, %ymm0, %ymm0 popl %ecx ret Expected Disassembly: - foo: movl 8(%esp), %eax vmovaps(%eax), %ymm0 vaddsubps %ymm0, %ymm0, %ymm0 ret Or better yet: foo: vmovaps8(%esp), %ymm0 vaddsubps %ymm0, %ymm0, %ymm0 ret The correct code shown above is already generated for x86-64 builds (-m64), so this optimization deficiency affects only x86-32 builds (-m32).
[Bug c++/81392] New: Improve diagnostics for [[fallthrough]] attribute that is missing a semicolon
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81392 Bug ID: 81392 Summary: Improve diagnostics for [[fallthrough]] attribute that is missing a semicolon Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: diagnostic Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: cody at codygray dot com Target Milestone: --- The C++1z/C++17 draft standard introduces a [[fallthrough]] attribute to explicitly document that fall-through behavior is intended in a switch-case block. This works in conjunction with G++'s -Wimplicit-fallthrough option, which gives a warning about potentially unintended fall-through behaviors. The [[fallthrough]] attribute is required to be applied to an empty statement (see http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0188r1.pdf), and therefore requires a terminating semicolon. However, forgetting that semicolon is a common error. With the following code: #include int main() { switch (0) { case 0: std::cout << "a\n"; [[fallthrough]] case 1: std::cout << "b\n"; break; } } G++ (7.1 and the current trunk of 8.0) issues the following warning: warning: this statement may fall through [-Wimplicit-fallthrough=] std::cout << "a\n"; ~~^~~~ This is less helpful than it could be. The current Clang trunk provides a substantially more helpful error message in this case: error: fallthrough attribute is only allowed on empty statements [[fallthrough]] ^ note: did you forget ';'? [[fallthrough]] ^ ; It would be nice to have something similar in G++.
[Bug target/81456] New: x86-64 optimizer makes wrong decision when optimizing for size
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81456 Bug ID: 81456 Summary: x86-64 optimizer makes wrong decision when optimizing for size Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: cody at codygray dot com Target Milestone: --- Target: x86_64 Consider the following code (doesn't matter if you compile it as C or C++): #include int Bounce(int a, int b) { int mod = abs(a % b); return (a/b % 2) ? b-mod : mod; } When optimizing for speed (whether -O1, -O2, or -O3), this is compiled to the following x86-64 machine code: Bounce(int, int): movl%edi, %eax cltd idivl %esi movl%edx, %ecx sarl$31, %ecx xorl%ecx, %edx subl%ecx, %edx subl%edx, %esi testb $1, %al cmovne %esi, %edx movl%edx, %eax ret That output is observed as far back as (at least) GCC 4.8, and all the way up to the current 8.0 preview I have (8.0.0 20170716). However, when optimizing for size (-Os), the same function produces this output: Bounce: movl%edi, %eax cltd idivl %esi movl%edx, %ecx sarl$31, %ecx xorl%ecx, %edx subl%ecx, %edx testb $1, %al je .L1 subl%edx, %esi movl%esi, %edx .L1: movl%edx, %eax ret This defies expectations because it is actually *larger* (more bytes) than the version optimized for speed. The JE instruction in this version is 2 bytes, as is each MOVL instruction, making that section 6 bytes total. However, in the version optimized for speed, the CMOVNE instruction is 3 bytes, plus a 2-byte MOVL, for 5 bytes total. (The SUBL instruction there is required either way.) Now, one byte obviously isn't a big deal in terms of total size, except that the CMOV version is more performant, so even if these two versions were exactly the same size, it should be used in preference to the branching version! (The optimizer has no reason to suspect that the quotient in the division will be predictably odd or even, so a non-branching conditional move is most appropriate to get the best worst-case performance.) I notice that this is a regression post-GCC 6.3. In other words, GCC 6.3 generates the same code for -Os and -O1/-O2/-O3. I don't have GCC 7.0 available, so GCC 7.1 is the first version I have available that reproduces the described behavior. It continues to be there, as I said, in GCC 8. This also is *not* observed when targeting 32-bit x86. You get conditional moves when the target architecture supports them (P6 and later). So this affects only x86-64, where conditional moves are *always* available.