[Bug target/81614] New: x86 optimizer combines results of comparisons in a way that risks partial register stalls

2017-07-30 Thread cody at codygray dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81614

Bug ID: 81614
   Summary: x86 optimizer combines results of comparisons in a way
that risks partial register stalls
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: cody at codygray dot com
  Target Milestone: ---
Target: i?86-*-*

Consider the following code:

bool foo(int a, int b, int c)
{
// It doesn't matter if this short-circuits  ('||' vs. '|')
// because the optimizer treats them as equivalent.
return (a == c || b == c);
}

All versions of GCC (going back to at least 4.4.7 and forward to the current
8.0 preview) translate this to the following optimized assembly on x86 targets:

foo(int, int, int):
movl12(%esp), %edx
cmpl%edx, 4(%esp)
sete%al
cmpl8(%esp), %edx
sete%dl
orl %edx, %eax
ret

The problem here is the second-to-last instruction. It ORs together two full
32-bit registers, even though the preceding SETE instructions only set the low
8 bits of each register. This results in a speed-zapping phenomenon on
virtually all x86 processors called a *partial register stall*.

(See http://www.agner.org/optimize/microarchitecture.pdf for details on exactly
how this is a performance problem on various implementations of x86. Although
there are differences in exactly *why* it is a speed penalty, it virtually
always is and *certainly* should be considered one when the output is tuned for
a generic x86 target.)

You get the same results at all optimization levels, including -Os (at least,
the relevant portion of the code is the same). You also see this for x86-64
targets:

foo(int, int, int):
cmpl%edx, %edi
sete%al
cmpl%esi, %edx
sete%dl
orl %edx, %eax
ret

One of two things should be done instead: either (A) perform the bitwise
operation *only* on the low bytes, or (B) pre-zero the entire 32-bit register
*before* setting its low byte to break dependencies.

Proposed Resolution A (use only low bytes):

foo(int, int, int):
movl12(%esp), %edx
cmpl%edx, 4(%esp)
sete%al
cmpl8(%esp), %edx
sete%dl
orl %dl, %al
ret

Proposed Resolution B (pre-zero to break dependencies):

foo(int, int, int):
movl12(%esp), %edx
xorl%eax, %eax
cmpl%edx, 4(%esp)
sete%al
xorl%ecx, %ecx
cmpl8(%esp), %edx
sete%cl
orl %ecx, %eax
ret

Approach A is the one used by Clang and MSVC. It solves the problem of partial
register stalls while avoiding the need for a third register as in Approach B.

The disadvantage of Approach A is that it creates only a byte-sized (8-bit)
result. This is perfectly fine if the function returns a bool, but doesn't work
if the function returns an integer type. There are two ways to solve that. What
GCC currently does if you change foo() to return int is add a MOVZBL
instruction between the OR and RET:

foo(int, int, int):
movl12(%esp), %edx
cmpl%edx, 4(%esp)
sete%al
cmpl8(%esp), %edx
sete%dl
orl %edx, %eax
movzbl  %al, %eax
ret

This zero-extends the result in AL into EAX. (Notice that the partial register
stall hazard is still there.) This existing behavior could simply be
maintained. However, it would be more optimal to pre-zero as shown in Approach
B. (For details on why this would be more optimal on all x86
microarchitectures, see here: https://stackoverflow.com/a/33668295).

[Bug target/81614] Should -mtune-ctrl=partial_reg_stall be turned by default?

2017-07-30 Thread cody at codygray dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81614

--- Comment #3 from Cody Gray  ---
(In reply to Uroš Bizjak from comment #1)
> Partial register stalls were discussed many times in the past, but
> apparently the compiler still produces fastest code when partial register
> stalls are enabled on latest target processors (e.g. -mtune=intel).

I don't understand what that means. -mtune=intel does *not* fix the partial
register stall problem. It should. All Intel CPUs prior to Haswell would
absolutely experience partial register stalls on this code, resulting in a
performance degradation.

-mtune-ctrl=partial_reg_stall does get the correct code, but I wasn't aware of
this option and I believe I shouldn't have to be. If a developer is getting
sub-optimal code even when he is asking the compiler to tune for his specific
microarchitecture, then the optimizer has a bug.

This is not an issue where there are arguments on either side. There is
absolutely no benefit to generating the code that the compiler currently does.
It is the same number of bytes to OR the BYTE-sized registers as it is to OR
the DWORD-sized registers, while the former will run faster on the vast
majority of CPUs and won't be any slower on the others.

> Also, it is hard to confirm tuning PRs without hard benchmark data.

No, it really isn't. I know that's a canned response, likely brought about by
hard-won experience with a lot of dubious "tuning" feature requests, but it's
just a cop-out in this case, if not outright dismissive. Partial register
stalls are a well-documented phenomenon, confirmed by multiple sources, and
have been a significant source of performance degradation since the Pentium Pro
was released circa 1995.

Agner Fog's manuals, as cited above, are really the authoritative reference
when it comes to performance tuning on x86, and they provide confirmation of
this in spades. In fact, I would argue that an accurate conceptual
understanding of the microarchitecture is often a better guide than one-off
microbenchmarks, since the latter are so difficult to craft and therefore so
often misleading. For example, the effects of the stall might be masked by the
overhead of the function call, but when the code is inlined or *certainly* when
it is executed within an inner loop, there will be a significant performance
degradation.

Again, if this were an issue where I was proposing bloating the size of the
code for a small payoff in speed, I could see how you might be skeptical. But
there is literally no downside to making this change.

You could possibly argue that -mtune-ctrl=partial_reg_stall should not be
turned on when tuning for Haswell and later microarchitectures, as Haswell was
the first to alleviate the visible performance penalties associated with
reading from a full 32-bit register after writing to a partial 8-bit "view" of
that same register. However, this applies *only* to the low-byte register
(e.g., AL, CL, DL, etc.). With the high-byte registers (e.g., AH, CH, DH,
etc.), there is still a loss in performance because an extra µop has to be
inserted between the write to the 8-bit register and the read from the 32-bit
register. This increases the latency by one clock cycle, and so unless the xH
partial registers are treated differently from the xL partial registers,
applying the optimizations described would still result in a performance win,
especially since there is no drawback.

[Bug target/81274] New: x86 optimizer emits unnecessary LEA instruction when using AVX intrinsics

2017-07-01 Thread cody at codygray dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81274

Bug ID: 81274
   Summary: x86 optimizer emits unnecessary LEA instruction when
using AVX intrinsics
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: cody at codygray dot com
  Target Milestone: ---
Target: i?86-*-*

When AVX intrinsics are used in a function, the x86-32 optimizer emits
unnecessary LEA instructions that clobber a register, forcing it to be
preserved at additional expense.


Test Code:
--

#include 

__m256 foo(const float *x)
{
   __m256 ymmX = _mm256_load_ps(&x[0]);
   return _mm256_addsub_ps(ymmX, ymmX);
}


Compile with: "-m32 -mtune=generic -mavx -O2"

This is also reproduced at -O1 and -O3, and when tuning for any architecture
that supports AVX (not specific to the "generic" target).

It also does not matter whether the code is compiled in C or C++ mode.

This behavior is exhibited by *all* versions of GCC that support AVX targeting,
from at least 4.9.0 through the 8.0.0 (20170701).

The code compiles warning-free, of course.

See it live on Godbolt: https://godbolt.org/g/NDDgsA


Actual Disassembly:
---

foo:# -O2 or -O3
pushl  %ecx
movl   8(%esp), %eax
leal   8(%esp), %ecx
vmovaps(%eax), %ymm0
popl   %ecx
vaddsubps  %ymm0, %ymm0, %ymm0
ret

The LEA instruction performs a redundant load of the parameter from the stack
into ECX, and then promptly discards that value. The load of ECX also has
spill-over effects, requiring that additional code be emitted to preserve the
original value of this register (PUSH+POP).

The same bug is observed at -O1, but the ordering of the instructions is
slightly different and the load of ECX is actually used to load EAX, further
lengthening the dependency chain for no benefit whatsoever.

foo:# -O1
pushl  %ecx
leal   8(%esp), %ecx
movl   (%ecx), %eax
vmovaps(%eax), %ymm0
vaddsubps  %ymm0, %ymm0, %ymm0
popl   %ecx
ret


Expected Disassembly:
-

foo:
movl   8(%esp), %eax
vmovaps(%eax), %ymm0
vaddsubps  %ymm0, %ymm0, %ymm0
ret


Or better yet:

foo:
vmovaps8(%esp), %ymm0
vaddsubps  %ymm0, %ymm0, %ymm0
ret


The correct code shown above is already generated for x86-64 builds (-m64), so
this optimization deficiency affects only x86-32 builds (-m32).

[Bug c++/81392] New: Improve diagnostics for [[fallthrough]] attribute that is missing a semicolon

2017-07-11 Thread cody at codygray dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81392

Bug ID: 81392
   Summary: Improve diagnostics for [[fallthrough]] attribute that
is missing a semicolon
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: diagnostic
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: cody at codygray dot com
  Target Milestone: ---

The C++1z/C++17 draft standard introduces a [[fallthrough]] attribute to
explicitly document that fall-through behavior is intended in a switch-case
block.

This works in conjunction with G++'s -Wimplicit-fallthrough option, which gives
a warning about potentially unintended fall-through behaviors.

The [[fallthrough]] attribute is required to be applied to an empty statement
(see http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0188r1.pdf), and
therefore requires a terminating semicolon.

However, forgetting that semicolon is a common error. With the following code:

#include 

int main() {
switch (0) {
case 0:
std::cout << "a\n";
[[fallthrough]]
case 1:
std::cout << "b\n";
break;
}
}

G++ (7.1 and the current trunk of 8.0) issues the following warning:

warning: this statement may fall through [-Wimplicit-fallthrough=]
std::cout << "a\n";
~~^~~~

This is less helpful than it could be. The current Clang trunk provides a
substantially more helpful error message in this case:

error: fallthrough attribute is only allowed on empty statements
[[fallthrough]]
  ^
note: did you forget ';'?
[[fallthrough]]
   ^
   ;

It would be nice to have something similar in G++.

[Bug target/81456] New: x86-64 optimizer makes wrong decision when optimizing for size

2017-07-16 Thread cody at codygray dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81456

Bug ID: 81456
   Summary: x86-64 optimizer makes wrong decision when optimizing
for size
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: cody at codygray dot com
  Target Milestone: ---
Target: x86_64

Consider the following code (doesn't matter if you compile it as C or C++):

#include 
int Bounce(int a, int b)
{
int mod = abs(a % b);
return (a/b % 2) ? b-mod : mod;
}

When optimizing for speed (whether -O1, -O2, or -O3), this is compiled to the
following x86-64 machine code:

Bounce(int, int):
movl%edi, %eax

cltd
idivl   %esi
movl%edx, %ecx
sarl$31, %ecx
xorl%ecx, %edx
subl%ecx, %edx

subl%edx, %esi
testb   $1, %al
cmovne  %esi, %edx
movl%edx, %eax
ret

That output is observed as far back as (at least) GCC 4.8, and all the way up
to the current 8.0 preview I have (8.0.0 20170716).

However, when optimizing for size (-Os), the same function produces this
output:

Bounce:
movl%edi, %eax

cltd
idivl   %esi
movl%edx, %ecx
sarl$31, %ecx
xorl%ecx, %edx
subl%ecx, %edx

testb   $1, %al
je  .L1
subl%edx, %esi
movl%esi, %edx
.L1:
movl%edx, %eax
ret

This defies expectations because it is actually *larger* (more bytes) than the
version optimized for speed. The JE instruction in this version is 2 bytes, as
is each MOVL instruction, making that section 6 bytes total. However, in the
version optimized for speed, the CMOVNE instruction is 3 bytes, plus a 2-byte
MOVL, for 5 bytes total. (The SUBL instruction there is required either way.)

Now, one byte obviously isn't a big deal in terms of total size, except that
the CMOV version is more performant, so even if these two versions were exactly
the same size, it should be used in preference to the branching version! (The
optimizer has no reason to suspect that the quotient in the division will be
predictably odd or even, so a non-branching conditional move is most
appropriate to get the best worst-case performance.)

I notice that this is a regression post-GCC 6.3. In other words, GCC 6.3
generates the same code for -Os and -O1/-O2/-O3. I don't have GCC 7.0
available, so GCC 7.1 is the first version I have available that reproduces the
described behavior. It continues to be there, as I said, in GCC 8.

This also is *not* observed when targeting 32-bit x86. You get conditional
moves when the target architecture supports them (P6 and later). So this
affects only x86-64, where conditional moves are *always* available.