[Bug target/84986] New: Performance regression: loop no longer vectorized (x86-64)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84986 Bug ID: 84986 Summary: Performance regression: loop no longer vectorized (x86-64) Product: gcc Version: 8.0.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: gergo.barany at inria dot fr Target Milestone: --- Created attachment 43713 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43713&action=edit input function showing performance regression For context: I throw randomly generated code at compilers and look at differences in how they optimize; see https://github.com/gergo-/missed-optimizations for details if interested. The test case below is entirely artificial, I do *not* have any real-world application that depends on this. The attached test.c file contains a function with a simple loop: int N; long fn1(void) { short i; long a; i = a = 0; while (i < N) a -= i++; return a; } Until recently, this loop used to be vectorized on x86-64, with the core loop (if I understand the code correctly) looking something like this, as generated by GCC trunk from 20180206 (with -O3): 40: 66 0f 6f ce movdqa %xmm6,%xmm1 44: 66 0f 6f e3 movdqa %xmm3,%xmm4 48: 66 0f 6f d3 movdqa %xmm3,%xmm2 4c: 83 c0 01add$0x1,%eax 4f: 66 0f 65 cb pcmpgtw %xmm3,%xmm1 53: 66 0f fd df paddw %xmm7,%xmm3 57: 66 0f 69 e1 punpckhwd %xmm1,%xmm4 5b: 66 0f 61 d1 punpcklwd %xmm1,%xmm2 5f: 66 0f 6f cc movdqa %xmm4,%xmm1 63: 66 0f 6f e5 movdqa %xmm5,%xmm4 67: 66 44 0f 6f c2 movdqa %xmm2,%xmm8 6c: 66 0f 66 e2 pcmpgtd %xmm2,%xmm4 70: 66 44 0f 62 c4 punpckldq %xmm4,%xmm8 75: 66 0f 6a d4 punpckhdq %xmm4,%xmm2 79: 66 0f 6f e1 movdqa %xmm1,%xmm4 7d: 66 41 0f fb c0 psubq %xmm8,%xmm0 82: 66 0f fb c2 psubq %xmm2,%xmm0 86: 66 0f 6f d5 movdqa %xmm5,%xmm2 8a: 66 0f 66 d1 pcmpgtd %xmm1,%xmm2 8e: 66 0f 62 e2 punpckldq %xmm2,%xmm4 92: 66 0f 6a ca punpckhdq %xmm2,%xmm1 96: 66 0f fb c4 psubq %xmm4,%xmm0 9a: 66 0f fb c1 psubq %xmm1,%xmm0 9e: 39 c1 cmp%eax,%ecx a0: 77 9e ja 40 (I'm sorry this comes from objdump, I didn't keep that GCC version around to generate a nicer assembly listing.) With a version from 20180319 (r258665), this is no longer the case: .L3: movswq %dx, %rcx addl$1, %edx subq%rcx, %rax movswl %dx, %ecx cmpl%esi, %ecx jl .L3 Linking the two versions against a driver program, which simply calls this function many times after setting N to SHRT_MAX, shows a slowdown of about 1.8x: $ time ./test.20180206 ; time ./test.20180319 32767 elements in 0.09 sec on average, result = -53682176100 real0m8.875s user0m8.844s sys 0m0.028s 32767 elements in 0.16 sec on average, result = -53682176100 real0m15.691s user0m15.688s sys 0m0.000s Target: x86_64-pc-linux-gnu Configured with: ../../src/gcc/configure --prefix=/home/gergo/optcheck/compilers/install --enable-languages=c --with-newlib --without-headers --disable-bootstrap --disable-nls --disable-shared --disable-multilib --disable-decimal-float --disable-threads --disable-libatomic --disable-libgomp --disable-libmpx --disable-libquadmath --disable-libssp --disable-libvtv --disable-libstdcxx --program-prefix=optcheck-x86- --target=x86_64-pc-linux-gnu Thread model: single This is under Linux on a machine whose CPU identifies itself as Intel(R) Core(TM) i7-4712HQ CPU @ 2.30GHz. For whatever it's worth, Clang goes the opposite way, vectorizes very aggressively, and ends up slower: $ time ./test.clang 32767 elements in 0.19 sec on average, result = -53682176100 real0m18.930s user0m18.928s sys 0m0.000s With the previous version, GCC was about 2.1x faster than Clang, this seems to have regressed to "only" 1.2x faster.
[Bug target/84986] Performance regression: loop no longer vectorized (x86-64)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84986 --- Comment #1 from Gergö Barany --- Created attachment 43714 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43714&action=edit test driver
[Bug tree-optimization/81346] Missed constant propagation into comparison
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81346 --- Comment #17 from Gergö Barany --- Thanks for fixing this. I did notice a small thing that might be considered a tiny regression due to the fix. If the divisor is a small power of 2, as in the following example: int fn1(char p1) { long a; char b; int c = a = 4; b = !(p1 / a); if (b) c = 0; return c; } the division used to be replaced by a shift that updated the condition code register (again, on ARM; r250337): lsrsr3, r0, #2 movne r0, #4 moveq r0, #0 bx lr whereas after the fix (tested on r250342) the new folding rule takes precedence and generates one instruction more: add r0, r0, #3 cmp r0, #6 movhi r0, #4 movls r0, #0 bx lr I guess the rule could be updated to only apply if the divisor is not a small power of 2, or folding a division by a power of 2 into a shift could be prioritized. Sorry about only pointing this out two months later! Also, let me stress that I do not have code that depends on this transformation. This came out of research I'm doing on missed optimization, and this was one example I found interesting.
[Bug target/80861] New: ARM (VFPv3): Inefficient float-to-char conversion goes through memory
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80861 Bug ID: 80861 Summary: ARM (VFPv3): Inefficient float-to-char conversion goes through memory Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: gergo.barany at inria dot fr Target Milestone: --- Created attachment 41407 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41407&action=edit Input C file for triggering the bug Consider the attached code: $ cat tst.c char fn1(float p1) { return (char) p1; } GCC from trunk from two weeks ago generates this code on ARM: $ gcc tst.c -O3 -S -o - .arch armv7-a .eabi_attribute 28, 1 .eabi_attribute 20, 1 .eabi_attribute 21, 1 .eabi_attribute 23, 3 .eabi_attribute 24, 1 .eabi_attribute 25, 1 .eabi_attribute 26, 1 .eabi_attribute 30, 2 .eabi_attribute 34, 1 .eabi_attribute 18, 4 .file "tst.c" .text .align 2 .global fn1 .syntax unified .arm .fpu vfpv3-d16 .type fn1, %function fn1: @ args = 0, pretend = 0, frame = 8 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. vcvt.u32.f32s15, s0 sub sp, sp, #8 vstr.32 s15, [sp, #4] @ int ldrbr0, [sp, #4]@ zero_extendqisi2 add sp, sp, #8 @ sp needed bx lr .size fn1, .-fn1 .ident "GCC: (GNU) 8.0.0 20170510 (experimental)" Going through memory for the int-to-char truncation after the float-to-int conversion (vcvt) is excessive. For comparison, this is the entire code generated by Clang: @ BB#0: vcvt.u32.f32s0, s0 vmovr0, s0 bx lr And this is what CompCert produces for the core of the function (stack manipulation code omitted): vcvt.u32.f32 s12, s0 vmovr0, s12 and r0, r0, #255 My GCC version: Target: armv7a-eabihf Configured with: --target=armv7a-eabihf --with-arch=armv7-a --with-fpu=vfpv3-d16 --with-float-abi=hard --with-float=hard Thread model: single gcc version 8.0.0 20170510 (experimental) (GCC)
[Bug target/80905] New: ARM: Useless initialization of struct passed by value
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80905 Bug ID: 80905 Summary: ARM: Useless initialization of struct passed by value Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: gergo.barany at inria dot fr Target Milestone: --- Created attachment 41432 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41432&action=edit Input C file for triggering the issue Input program: $ cat tst.c struct S0 { int f0; int f1; int f2; int f3; }; int f1(struct S0 p) { return p.f0; } int f2(struct S0 p) { return p.f0 + p.f3; } When entering the function, GCC copies the entire struct from registers to the stack, even fields that are never used. Fields that *are* used are then reloaded from the stack even if they are still available in the very same registers: $ gcc tst.c -Wall -W -O3 -S -o - .arch armv7-a .eabi_attribute 28, 1 .eabi_attribute 20, 1 .eabi_attribute 21, 1 .eabi_attribute 23, 3 .eabi_attribute 24, 1 .eabi_attribute 25, 1 .eabi_attribute 26, 1 .eabi_attribute 30, 2 .eabi_attribute 34, 1 .eabi_attribute 18, 4 .file "tst.c" .text .align 2 .global f1 .syntax unified .arm .fpu vfpv3-d16 .type f1, %function f1: @ args = 0, pretend = 0, frame = 16 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. sub sp, sp, #16 add ip, sp, #16 stmdb ip, {r0, r1, r2, r3} ldr r0, [sp] add sp, sp, #16 @ sp needed bx lr .size f1, .-f1 .align 2 .global f2 .syntax unified .arm .fpu vfpv3-d16 .type f2, %function f2: @ args = 0, pretend = 0, frame = 16 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. sub sp, sp, #16 add ip, sp, #16 stmdb ip, {r0, r1, r2, r3} ldr r0, [sp] ldr r3, [sp, #12] add r0, r0, r3 add sp, sp, #16 @ sp needed bx lr .size f2, .-f2 .ident "GCC: (GNU) 8.0.0 20170527 (experimental)" Target: armv7a-eabihf Configured with: --target=armv7a-eabihf --with-arch=armv7-a --with-fpu=vfpv3-d16 --with-float-abi=hard --with-float=hard gcc version 8.0.0 20170527 (experimental) (GCC) This seems to be specific to ARM as I cannot reproduce this behavior on x86-64 or PowerPC. For comparison, LLVM generates the following code for ARM: f1: .fnstart @ BB#0: bx lr f2: .fnstart @ BB#0: add r0, r0, r3 bx lr
[Bug target/81012] New: ARM: Spill instead of register copy / dead store on int-to-double conversion
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81012 Bug ID: 81012 Summary: ARM: Spill instead of register copy / dead store on int-to-double conversion Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: gergo.barany at inria dot fr Target Milestone: --- Created attachment 41496 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41496&action=edit Input C file for triggering the issue Input file (also in attachment): double fn2(int p1, int p2) { double a = p1; if (744073425321881 * p2 + 5) a = 2; return a; } Generated code on ARMv7 for VFPv3: $ gcc tst.c -Wall -Wextra -O3 -fomit-frame-pointer -S -o - [...] fn2: @ args = 0, pretend = 0, frame = 8 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. movwr3, #42171 movtr3, 2 push{r4, r5} movwr2, #65433 sub sp, sp, #8 asr r5, r1, #31 movtr2, 6195 mvn r4, #4 mul r3, r3, r1 str r0, [sp, #4] // SPILL mla r0, r2, r5, r3 mvn r5, #0 umull r2, r3, r1, r2 add r3, r0, r3 cmp r3, r5 cmpeq r2, r4 vldreq.32 s15, [sp, #4] @ int vmovne.f64 d0, #2.0e+0 vcvteq.f64.s32 d0, s15 add sp, sp, #8 @ sp needed pop {r4, r5} bx lr .size fn2, .-fn2 .ident "GCC: (GNU) 8.0.0 20170606 (experimental)" Note the store I marked "SPILL". It is a store of the integer register r0 which is reloaded on the line marked "@ int" into a floating-point register for subsequent int-to-double conversion. The spill frees r0 for other use, but it would be better to just replace the spill/reload sequence with vmov s15, r0 since the register is available. Also, if the large constant 744073425321881 in the if condition is changed to something smaller like 1881 (that fits into a mov's immediate field), GCC generates this code: fn2: @ args = 0, pretend = 0, frame = 8 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. movwr3, #1881 sub sp, sp, #8 mul r1, r3, r1 str r0, [sp, #4]// DEAD STORE cmn r1, #5 vmovne.f64 d0, #2.0e+0 vmoveq s15, r0 @ int vcvteq.f64.s32 d0, s15 add sp, sp, #8 @ sp needed bx lr This does perform a conditional move from r0 to s15, but it also generates a dead store to the stack. Clang and CompCert both just do a copy and don't touch the stack for this value. $ gcc -v [...] Target: armv7a-eabihf Configured with: --target=armv7a-eabihf --with-arch=armv7-a --with-fpu=vfpv3-d16 --with-float-abi=hard --with-float=hard Thread model: single gcc version 8.0.0 20170510 (experimental) (GCC) Not sure if this is related to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80861 which also the stack for a float-to-char conversion. But that's the other direction, and if I understand correctly, there the problem is related to the final sign extension.
[Bug target/81012] ARM: Spill instead of register copy / dead store on int-to-double conversion
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81012 --- Comment #2 from Gergö Barany --- Created attachment 41672 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41672&action=edit Smaller test case Added a smaller test case: int fn3(int p1, int p2) { int a = p2; if (p1) a *= 10.0; return a; } It compiles to the following: fn3: @ args = 0, pretend = 0, frame = 8 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. sub sp, sp, #8 cmp r0, #0 str r1, [sp, #4] beq .L2 vmov.f64d6, #1.0e+1 vmovs15, r1 @ int vcvt.f64.s32d7, s15 vmul.f64d7, d7, d6 vcvt.s32.f64s15, d7 vstr.32 s15, [sp, #4] @ int .L2: ldr r0, [sp, #4] add sp, sp, #8 @ sp needed bx lr .size fn3, .-fn3 .ident "GCC: (GNU) 8.0.0 20170626 (experimental)" Instead of the first store, r1 should be moved to r0. The second store should then be a vmov r0, s15. No spills needed. This is done correctly on x86-64: fn3: .LFB0: .cfi_startproc testl %edi, %edi movl%esi, %eax je .L2 pxor%xmm0, %xmm0 cvtsi2sd%esi, %xmm0 mulsd .LC0(%rip), %xmm0 cvttsd2si %xmm0, %eax .L2: rep ret
[Bug tree-optimization/81346] New: Missed constant propagation into comparison
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81346 Bug ID: 81346 Summary: Missed constant propagation into comparison Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: gergo.barany at inria dot fr Target Milestone: --- Created attachment 41694 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41694&action=edit Input C file for triggering the issue The attached C file contains the following function: int fn1(int p1) { int b = (p1 / 12 == 6); return b; } As these are integers, the expression (p1 / 12 == 6) can be optimized to a subtraction and an unsigned compare. GCC can do this (here for ARM): fn1: sub r0, r0, #72 cmp r0, #11 movhi r0, #0 movls r0, #1 bx lr The attached file also contains the following function: int fn2(int p1) { int a = 6; int b = (p1 / 12 == a); return b; } This is equivalent to the above code; the value of a can only ever be 6. Consequently, the output machine code should be equivalent. However, GCC does not recognize the above pattern and generates more complex code: fn2: movwr3, #43691 movtr3, 10922 smull r2, r3, r3, r0 asr r0, r0, #31 rsb r0, r0, r3, asr #1 sub r0, r0, #6 clz r0, r0 lsr r0, r0, #5 bx lr I believe this is a target-independent optimization issue because x86-64 and PowerPC behave analogously, for example (x86-64): fn1: subl$72, %edi xorl%eax, %eax cmpl$11, %edi setbe %al ret fn2: movl%edi, %eax movl$715827883, %edx sarl$31, %edi imull %edx xorl%eax, %eax sarl%edx subl%edi, %edx cmpl$6, %edx sete%al ret Version: gcc version 8.0.0 20170706 (experimental) (GCC) Configured with: --target=armv7a-eabihf --with-arch=armv7-a --with-fpu=vfpv3-d16 --with-float-abi=hard --with-float=hard or with: --target=x86_64-pc-linux-gnu or with: --target=ppc-eabi
[Bug tree-optimization/81346] Missed constant propagation into comparison
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81346 --- Comment #1 from Gergö Barany --- Sorry, forgot to add the command line. I use gcc -O3 on all platforms