[Bug target/84986] New: Performance regression: loop no longer vectorized (x86-64)

2018-03-20 Thread gergo.barany at inria dot fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84986

Bug ID: 84986
   Summary: Performance regression: loop no longer vectorized
(x86-64)
   Product: gcc
   Version: 8.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: gergo.barany at inria dot fr
  Target Milestone: ---

Created attachment 43713
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43713&action=edit
input function showing performance regression

For context: I throw randomly generated code at compilers and look at
differences in how they optimize; see
https://github.com/gergo-/missed-optimizations for details if interested. The
test case below is entirely artificial, I do *not* have any real-world
application that depends on this.

The attached test.c file contains a function with a simple loop:

int N;
long fn1(void) {
  short i;
  long a;
  i = a = 0;
  while (i < N)
a -= i++;
  return a;
}

Until recently, this loop used to be vectorized on x86-64, with the core loop
(if I understand the code correctly) looking something like this, as generated
by GCC trunk from 20180206 (with -O3):

  40:   66 0f 6f ce movdqa %xmm6,%xmm1
  44:   66 0f 6f e3 movdqa %xmm3,%xmm4
  48:   66 0f 6f d3 movdqa %xmm3,%xmm2
  4c:   83 c0 01add$0x1,%eax
  4f:   66 0f 65 cb pcmpgtw %xmm3,%xmm1
  53:   66 0f fd df paddw  %xmm7,%xmm3
  57:   66 0f 69 e1 punpckhwd %xmm1,%xmm4
  5b:   66 0f 61 d1 punpcklwd %xmm1,%xmm2
  5f:   66 0f 6f cc movdqa %xmm4,%xmm1
  63:   66 0f 6f e5 movdqa %xmm5,%xmm4
  67:   66 44 0f 6f c2  movdqa %xmm2,%xmm8
  6c:   66 0f 66 e2 pcmpgtd %xmm2,%xmm4
  70:   66 44 0f 62 c4  punpckldq %xmm4,%xmm8
  75:   66 0f 6a d4 punpckhdq %xmm4,%xmm2
  79:   66 0f 6f e1 movdqa %xmm1,%xmm4
  7d:   66 41 0f fb c0  psubq  %xmm8,%xmm0
  82:   66 0f fb c2 psubq  %xmm2,%xmm0
  86:   66 0f 6f d5 movdqa %xmm5,%xmm2
  8a:   66 0f 66 d1 pcmpgtd %xmm1,%xmm2
  8e:   66 0f 62 e2 punpckldq %xmm2,%xmm4
  92:   66 0f 6a ca punpckhdq %xmm2,%xmm1
  96:   66 0f fb c4 psubq  %xmm4,%xmm0
  9a:   66 0f fb c1 psubq  %xmm1,%xmm0
  9e:   39 c1   cmp%eax,%ecx
  a0:   77 9e   ja 40 

(I'm sorry this comes from objdump, I didn't keep that GCC version around to
generate a nicer assembly listing.)

With a version from 20180319 (r258665), this is no longer the case:

.L3:
movswq  %dx, %rcx
addl$1, %edx
subq%rcx, %rax
movswl  %dx, %ecx
cmpl%esi, %ecx
jl  .L3

Linking the two versions against a driver program, which simply calls this
function many times after setting N to SHRT_MAX, shows a slowdown of about
1.8x:

$ time ./test.20180206 ; time ./test.20180319 
32767 elements in 0.09 sec on average, result = -53682176100

real0m8.875s
user0m8.844s
sys 0m0.028s
32767 elements in 0.16 sec on average, result = -53682176100

real0m15.691s
user0m15.688s
sys 0m0.000s

Target: x86_64-pc-linux-gnu
Configured with: ../../src/gcc/configure
--prefix=/home/gergo/optcheck/compilers/install --enable-languages=c
--with-newlib --without-headers --disable-bootstrap --disable-nls
--disable-shared --disable-multilib --disable-decimal-float --disable-threads
--disable-libatomic --disable-libgomp --disable-libmpx --disable-libquadmath
--disable-libssp --disable-libvtv --disable-libstdcxx
--program-prefix=optcheck-x86- --target=x86_64-pc-linux-gnu
Thread model: single

This is under Linux on a machine whose CPU identifies itself as Intel(R)
Core(TM) i7-4712HQ CPU @ 2.30GHz.


For whatever it's worth, Clang goes the opposite way, vectorizes very
aggressively, and ends up slower:

$ time ./test.clang 
32767 elements in 0.19 sec on average, result = -53682176100

real0m18.930s
user0m18.928s
sys 0m0.000s

With the previous version, GCC was about 2.1x faster than Clang, this seems to
have regressed to "only" 1.2x faster.

[Bug target/84986] Performance regression: loop no longer vectorized (x86-64)

2018-03-20 Thread gergo.barany at inria dot fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84986

--- Comment #1 from Gergö Barany  ---
Created attachment 43714
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43714&action=edit
test driver

[Bug tree-optimization/81346] Missed constant propagation into comparison

2017-09-14 Thread gergo.barany at inria dot fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81346

--- Comment #17 from Gergö Barany  ---
Thanks for fixing this. I did notice a small thing that might be considered a
tiny regression due to the fix.

If the divisor is a small power of 2, as in the following example:

int fn1(char p1) {
  long a;
  char b;
  int c = a = 4;
  b = !(p1 / a);
  if (b)
c = 0;
  return c;
}

the division used to be replaced by a shift that updated the condition code
register (again, on ARM; r250337):

lsrsr3, r0, #2
movne   r0, #4
moveq   r0, #0
bx  lr

whereas after the fix (tested on r250342) the new folding rule takes precedence
and generates one instruction more:

add r0, r0, #3
cmp r0, #6
movhi   r0, #4
movls   r0, #0
bx  lr

I guess the rule could be updated to only apply if the divisor is not a small
power of 2, or folding a division by a power of 2 into a shift could be
prioritized.

Sorry about only pointing this out two months later! Also, let me stress that I
do not have code that depends on this transformation. This came out of research
I'm doing on missed optimization, and this was one example I found interesting.

[Bug target/80861] New: ARM (VFPv3): Inefficient float-to-char conversion goes through memory

2017-05-22 Thread gergo.barany at inria dot fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80861

Bug ID: 80861
   Summary: ARM (VFPv3): Inefficient float-to-char conversion goes
through memory
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: gergo.barany at inria dot fr
  Target Milestone: ---

Created attachment 41407
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41407&action=edit
Input C file for triggering the bug

Consider the attached code:

$ cat tst.c
char fn1(float p1) {
  return (char) p1;
}

GCC from trunk from two weeks ago generates this code on ARM:

$ gcc tst.c -O3 -S -o -
.arch armv7-a
.eabi_attribute 28, 1
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 1
.eabi_attribute 30, 2
.eabi_attribute 34, 1
.eabi_attribute 18, 4
.file   "tst.c"
.text
.align  2
.global fn1
.syntax unified
.arm
.fpu vfpv3-d16
.type   fn1, %function
fn1:
@ args = 0, pretend = 0, frame = 8
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
vcvt.u32.f32s15, s0
sub sp, sp, #8
vstr.32 s15, [sp, #4]   @ int
ldrbr0, [sp, #4]@ zero_extendqisi2
add sp, sp, #8
@ sp needed
bx  lr
.size   fn1, .-fn1
.ident  "GCC: (GNU) 8.0.0 20170510 (experimental)"


Going through memory for the int-to-char truncation after the float-to-int
conversion (vcvt) is excessive. For comparison, this is the entire code
generated by Clang:

@ BB#0:
vcvt.u32.f32s0, s0
vmovr0, s0
bx  lr

And this is what CompCert produces for the core of the function (stack
manipulation code omitted):

vcvt.u32.f32 s12, s0
vmovr0, s12
and r0, r0, #255


My GCC version:

Target: armv7a-eabihf
Configured with: --target=armv7a-eabihf --with-arch=armv7-a
--with-fpu=vfpv3-d16 --with-float-abi=hard --with-float=hard
Thread model: single
gcc version 8.0.0 20170510 (experimental) (GCC)

[Bug target/80905] New: ARM: Useless initialization of struct passed by value

2017-05-28 Thread gergo.barany at inria dot fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80905

Bug ID: 80905
   Summary: ARM: Useless initialization of struct passed by value
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: gergo.barany at inria dot fr
  Target Milestone: ---

Created attachment 41432
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41432&action=edit
Input C file for triggering the issue

Input program:

$ cat tst.c
struct S0 {
  int f0;
  int f1;
  int f2;
  int f3;
};

int f1(struct S0 p) {
return p.f0;
}

int f2(struct S0 p) {
return p.f0 + p.f3;
}



When entering the function, GCC copies the entire struct from registers to the
stack, even fields that are never used. Fields that *are* used are then
reloaded from the stack even if they are still available in the very same
registers:

$ gcc tst.c -Wall -W -O3 -S -o -
.arch armv7-a
.eabi_attribute 28, 1
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 1
.eabi_attribute 30, 2
.eabi_attribute 34, 1
.eabi_attribute 18, 4
.file   "tst.c"
.text
.align  2
.global f1
.syntax unified
.arm
.fpu vfpv3-d16
.type   f1, %function
f1:
@ args = 0, pretend = 0, frame = 16
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
sub sp, sp, #16
add ip, sp, #16
stmdb   ip, {r0, r1, r2, r3}
ldr r0, [sp]
add sp, sp, #16
@ sp needed
bx  lr
.size   f1, .-f1
.align  2
.global f2
.syntax unified
.arm
.fpu vfpv3-d16
.type   f2, %function
f2:
@ args = 0, pretend = 0, frame = 16
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
sub sp, sp, #16
add ip, sp, #16
stmdb   ip, {r0, r1, r2, r3}
ldr r0, [sp]
ldr r3, [sp, #12]
add r0, r0, r3
add sp, sp, #16
@ sp needed
bx  lr
.size   f2, .-f2
.ident  "GCC: (GNU) 8.0.0 20170527 (experimental)"

Target: armv7a-eabihf
Configured with: --target=armv7a-eabihf --with-arch=armv7-a
--with-fpu=vfpv3-d16 --with-float-abi=hard --with-float=hard
gcc version 8.0.0 20170527 (experimental) (GCC)


This seems to be specific to ARM as I cannot reproduce this behavior on x86-64
or PowerPC.


For comparison, LLVM generates the following code for ARM:

f1:
.fnstart
@ BB#0:
bx  lr

f2:
.fnstart
@ BB#0:
add r0, r0, r3
bx  lr

[Bug target/81012] New: ARM: Spill instead of register copy / dead store on int-to-double conversion

2017-06-07 Thread gergo.barany at inria dot fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81012

Bug ID: 81012
   Summary: ARM: Spill instead of register copy / dead store on
int-to-double conversion
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: gergo.barany at inria dot fr
  Target Milestone: ---

Created attachment 41496
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41496&action=edit
Input C file for triggering the issue

Input file (also in attachment):

double fn2(int p1, int p2) {
  double a = p1;
  if (744073425321881 * p2 + 5)
a = 2;
  return a;
}

Generated code on ARMv7 for VFPv3:

$ gcc tst.c -Wall -Wextra -O3 -fomit-frame-pointer -S -o -
[...]
fn2:
@ args = 0, pretend = 0, frame = 8
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
movwr3, #42171
movtr3, 2
push{r4, r5}
movwr2, #65433
sub sp, sp, #8
asr r5, r1, #31
movtr2, 6195
mvn r4, #4
mul r3, r3, r1
str r0, [sp, #4]   // SPILL
mla r0, r2, r5, r3
mvn r5, #0
umull   r2, r3, r1, r2
add r3, r0, r3
cmp r3, r5
cmpeq   r2, r4
vldreq.32   s15, [sp, #4]   @ int
vmovne.f64  d0, #2.0e+0
vcvteq.f64.s32  d0, s15
add sp, sp, #8
@ sp needed
pop {r4, r5}
bx  lr
.size   fn2, .-fn2
.ident  "GCC: (GNU) 8.0.0 20170606 (experimental)"

Note the store I marked "SPILL". It is a store of the integer register r0 which
is reloaded on the line marked "@ int" into a floating-point register for
subsequent int-to-double conversion. The spill frees r0 for other use, but it
would be better to just replace the spill/reload sequence with

vmov s15, r0

since the register is available.

Also, if the large constant 744073425321881 in the if condition is changed to
something smaller like 1881 (that fits into a mov's immediate field), GCC
generates this code:

fn2:
@ args = 0, pretend = 0, frame = 8
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
movwr3, #1881
sub sp, sp, #8
mul r1, r3, r1
str r0, [sp, #4]// DEAD STORE
cmn r1, #5
vmovne.f64  d0, #2.0e+0
vmoveq  s15, r0 @ int
vcvteq.f64.s32  d0, s15
add sp, sp, #8
@ sp needed
bx  lr

This does perform a conditional move from r0 to s15, but it also generates a
dead store to the stack.

Clang and CompCert both just do a copy and don't touch the stack for this
value.

$ gcc -v
[...]
Target: armv7a-eabihf
Configured with: --target=armv7a-eabihf --with-arch=armv7-a
--with-fpu=vfpv3-d16 --with-float-abi=hard --with-float=hard
Thread model: single
gcc version 8.0.0 20170510 (experimental) (GCC)

Not sure if this is related to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80861 which also the stack for a
float-to-char conversion. But that's the other direction, and if I understand
correctly, there the problem is related to the final sign extension.

[Bug target/81012] ARM: Spill instead of register copy / dead store on int-to-double conversion

2017-07-03 Thread gergo.barany at inria dot fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81012

--- Comment #2 from Gergö Barany  ---
Created attachment 41672
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41672&action=edit
Smaller test case

Added a smaller test case:

int fn3(int p1, int p2) {
  int a = p2;
  if (p1)
a *= 10.0;
  return a;
}

It compiles to the following:

fn3:
@ args = 0, pretend = 0, frame = 8
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
sub sp, sp, #8
cmp r0, #0
str r1, [sp, #4]
beq .L2
vmov.f64d6, #1.0e+1
vmovs15, r1 @ int
vcvt.f64.s32d7, s15
vmul.f64d7, d7, d6
vcvt.s32.f64s15, d7
vstr.32 s15, [sp, #4]   @ int
.L2:
ldr r0, [sp, #4]
add sp, sp, #8
@ sp needed
bx  lr
.size   fn3, .-fn3
.ident  "GCC: (GNU) 8.0.0 20170626 (experimental)"

Instead of the first store, r1 should be moved to r0. The second store should
then be a vmov r0, s15. No spills needed.

This is done correctly on x86-64:

fn3:
.LFB0:
.cfi_startproc
testl   %edi, %edi
movl%esi, %eax
je  .L2
pxor%xmm0, %xmm0
cvtsi2sd%esi, %xmm0
mulsd   .LC0(%rip), %xmm0
cvttsd2si   %xmm0, %eax
.L2:
rep ret

[Bug tree-optimization/81346] New: Missed constant propagation into comparison

2017-07-06 Thread gergo.barany at inria dot fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81346

Bug ID: 81346
   Summary: Missed constant propagation into comparison
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: gergo.barany at inria dot fr
  Target Milestone: ---

Created attachment 41694
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41694&action=edit
Input C file for triggering the issue

The attached C file contains the following function:

int fn1(int p1) {
  int b = (p1 / 12 == 6);
  return b;
}

As these are integers, the expression (p1 / 12 == 6) can be optimized to a
subtraction and an unsigned compare. GCC can do this (here for ARM):

fn1:
sub r0, r0, #72
cmp r0, #11
movhi   r0, #0
movls   r0, #1
bx  lr

The attached file also contains the following function:

int fn2(int p1) {
  int a = 6;
  int b = (p1 / 12 == a);
  return b;
}

This is equivalent to the above code; the value of a can only ever be 6.
Consequently, the output machine code should be equivalent. However, GCC does
not recognize the above pattern and generates more complex code:

fn2: 
movwr3, #43691
movtr3, 10922
smull   r2, r3, r3, r0
asr r0, r0, #31
rsb r0, r0, r3, asr #1
sub r0, r0, #6
clz r0, r0
lsr r0, r0, #5
bx  lr

I believe this is a target-independent optimization issue because x86-64 and
PowerPC behave analogously, for example (x86-64):

fn1:
subl$72, %edi
xorl%eax, %eax
cmpl$11, %edi
setbe   %al
ret

fn2:
movl%edi, %eax
movl$715827883, %edx
sarl$31, %edi
imull   %edx
xorl%eax, %eax
sarl%edx
subl%edi, %edx
cmpl$6, %edx
sete%al
ret

Version:
gcc version 8.0.0 20170706 (experimental) (GCC)
Configured with: --target=armv7a-eabihf --with-arch=armv7-a
--with-fpu=vfpv3-d16 --with-float-abi=hard --with-float=hard
or
with: --target=x86_64-pc-linux-gnu
or
with: --target=ppc-eabi

[Bug tree-optimization/81346] Missed constant propagation into comparison

2017-07-06 Thread gergo.barany at inria dot fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81346

--- Comment #1 from Gergö Barany  ---
Sorry, forgot to add the command line. I use gcc -O3 on all platforms