[Bug rtl-optimization/78972] New: [5/6/7 Regression] poor x86 simd instruction scheduling

2017-01-02 Thread liquidsun at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78972

Bug ID: 78972
   Summary: [5/6/7 Regression] poor x86 simd instruction
scheduling
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: liquidsun at gmail dot com
  Target Milestone: ---

Created attachment 40441
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40441&action=edit
example.c

[Bug target/78972] [5/6/7 Regression] poor x86 simd instruction scheduling

2017-01-02 Thread liquidsun at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78972

--- Comment #1 from Andrew M.  ---
gcc versions >= 5 started dropping all of the additions down to the bottom of
the function instead of keeping a running total. Optimization appears to follow
4.x.x up to tree-reassoc1 where >= 5 uses slightly different addition
scheduling. This stays the same until rtl-expand, where _all_ of the additions
get deferred to the bottom of the function, requiring a massive stack frame and
a large performance hit. No version of 4.x.x I tried had this problem, so it
looks like it was introduced in 5.

[Bug target/78972] [5/6/7 Regression] poor x86 simd instruction scheduling

2017-01-02 Thread liquidsun at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78972

--- Comment #2 from Andrew M.  ---
Created attachment 40442
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40442&action=edit
generated code for gcc-4.9.4 example.c -O1

[Bug target/78972] [5/6/7 Regression] poor x86 simd instruction scheduling

2017-01-02 Thread liquidsun at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78972

--- Comment #3 from Andrew M.  ---
Created attachment 40443
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40443&action=edit
generated code for gcc-6.3 example.c -O1

[Bug rtl-optimization/78972] [5/6/7 Regression] poor x86 simd instruction scheduling

2017-01-02 Thread liquidsun at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78972

Andrew M.  changed:

   What|Removed |Added

  Component|target  |rtl-optimization

--- Comment #4 from Andrew M.  ---
SSE2/AVX/AVX2 are all generated similarly poorly. Non-SIMD versions do not
appear to be affected.  I don't know enough (anything) about gcc to investigate
any further at this point

[Bug target/78972] [5/6/7 Regression] poor x86 simd instruction scheduling

2017-01-04 Thread liquidsun at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78972

--- Comment #8 from Andrew M.  ---
(In reply to Andrew Pinski from comment #7)
> One thing to try is -fno-tree-ter.

Stack sizes for -fno-tree-ter:

4.9.4: 272 bytes
5.1-5.4: 288 bytes
6.1-6.3: 560 bytes
7: 560 bytes

Performance improves a lot with -fno-tree-ter with 5.x going back to 4.9
levels, and 6.x being somewhere inbetween 4.9 and 6.x without -f-no-tree-ter.