[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-11-10 Thread guojiufu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #43 from Jiu Fu Guo --- Author: guojiufu Date: Mon Nov 11 06:30:38 2019 New Revision: 278034 URL: https://gcc.gnu.org/viewcvs?rev=278034&root=gcc&view=rev Log: rs6000: Refine small loop unroll in loop_unroll_adjust hook In this pat

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-27 Thread guojiufu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #42 from Jiu Fu Guo --- Author: guojiufu Date: Mon Oct 28 05:23:24 2019 New Revision: 277501 URL: https://gcc.gnu.org/viewcvs?rev=277501&root=gcc&view=rev Log: rs6000: Enable limited unrolling at -O2 In PR88760, there are a few diss

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-22 Thread guojiufu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #41 from Jiu Fu Guo --- for code: subroutine foo (i, i1, block) integer :: i, i1 integer :: block(9, 9, 9) block(i:9,1,i1) = block(i:9,1,i1) - 10 end subroutine foo "-funroll-loops --param max-unroll-times=2 --param

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-14 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #40 from rguenther at suse dot de --- On Sat, 12 Oct 2019, guojiufu at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > --- Comment #39 from Jiu Fu Guo --- > For small loop (1-2 stmts), in forms of GI

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-12 Thread guojiufu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #39 from Jiu Fu Guo --- For small loop (1-2 stmts), in forms of GIMPLE and RTL, it would be around 5-10 instructions: 2-4 insns per stmt, ~4 insns for idx. With current unroller, here is a statistic on spec2017. Using --param max-un

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-11 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #38 from rguenther at suse dot de --- On Fri, 11 Oct 2019, segher at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > --- Comment #37 from Segher Boessenkool --- > -- If it is done in RTL it should re

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-11 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #37 from Segher Boessenkool --- -- If it is done in RTL it should really be done earlier, it doesn't get all optimisations it should right now. -- Unrolling small loops more aggressively (at -O2 even) perhaps needs to be done at a di

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-11 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #36 from rguenther at suse dot de --- On Fri, 11 Oct 2019, wilco at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > --- Comment #32 from Wilco --- > (In reply to Segher Boessenkool from comment #31)

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-11 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #35 from rguenther at suse dot de --- On Fri, 11 Oct 2019, wilco at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > --- Comment #34 from Wilco --- > (In reply to rguent...@suse.de from comment #30) >

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-11 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #34 from Wilco --- (In reply to rguent...@suse.de from comment #30) > On Fri, 11 Oct 2019, wilco at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > > > --- Comment #29 from Wilco --- > > (In repl

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-11 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #33 from rguenther at suse dot de --- On Fri, 11 Oct 2019, segher at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > --- Comment #31 from Segher Boessenkool --- > Gimple passes know a lot about machi

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-11 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #32 from Wilco --- (In reply to Segher Boessenkool from comment #31) > Gimple passes know a lot about machine details, too. > > Irrespective of if this is "low-level" or "high-level", it should be done > earlier than it is now. It s

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-11 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #31 from Segher Boessenkool --- Gimple passes know a lot about machine details, too. Irrespective of if this is "low-level" or "high-level", it should be done earlier than it is now. It should either be done right after expand, or s

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-11 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #30 from rguenther at suse dot de --- On Fri, 11 Oct 2019, wilco at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > --- Comment #29 from Wilco --- > (In reply to Jiu Fu Guo from comment #28) > > For

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-11 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #29 from Wilco --- (In reply to Jiu Fu Guo from comment #28) > For these kind of small loops, it would be acceptable to unroll in GIMPLE, > because register pressure and instruction cost may not be major concerns; > just like "cunrol

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-10 Thread guojiufu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 Jiu Fu Guo changed: What|Removed |Added CC||guojiufu at gcc dot gnu.org --- Comment #28

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-10 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #27 from Wil

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-09-27 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #26 from Segher Boessenkool --- Yeah, and it probably should be a param (that different targets can default differently, per CPU probably). On most Power CPUs all loops take a minimum number of cycles per iteration (say, three), but

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-09-27 Thread rearnsha at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #25 from Richard Earnshaw --- Well very small loops should be unrolled more than slightly larger ones. So perhaps if the loop body is only 3 or 4 instructions it should be unrolled four times but above that perhaps only twice. Some

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-09-25 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #24 from Segher Boessenkool --- On some (many?) targets it would be good to unroll all loops with a small body (not containing calls etc.) at -O2 already, some small number of times (2 or 4 maybe).

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-25 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #23 from Wilco --- (In reply to ktkachov from comment #22) > helps even more. On Cortex-A72 it gives a bit more than 6% (vs 3%) > improvement on parest, and about 5.3% on a more aggressive CPU. > I tried unrolling 8x in a similar man

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-24 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #22 from ktkachov at gcc dot gnu.org --- Some more experiments... Unrolling 4x in a similar way to my previous example and not splitting the accumulator (separate issue): unsigned int *colnums; double *val; struct foostruct { unsig

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-24 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #21 from Wilco --- (In reply to rguent...@suse.de from comment #20) > On Thu, 24 Jan 2019, wilco at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > > > --- Comment #19 from Wilco --- > > (In repl

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-24 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #20 from rguenther at suse dot de --- On Thu, 24 Jan 2019, wilco at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > --- Comment #19 from Wilco --- > (In reply to rguent...@suse.de from comment #18) >

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-24 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #19 from Wilco --- (In reply to rguent...@suse.de from comment #18) > > 1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's > > suggestion) > > If that helps, sure (I'd have guessed uarchs are going to split > load

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-24 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #18 from rguenther at suse dot de --- On Thu, 24 Jan 2019, ktkachov at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > --- Comment #17 from ktkachov at gcc dot gnu.org --- > I played around with the s

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-24 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #17 from ktkachov at gcc dot gnu.org --- I played around with the source to do some conservative 2x manual unrolling in the two hottest functions in 510.parest_r (3 more-or-less identical tight FMA loops). This was to try out Richard's

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-17 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #16 from Wilco --- (In reply to rguent...@suse.de from comment #15) > which is what I refered to for branch prediction. Your & prompts me > to a way to do sth similar as duffs device, turning the loop into a nest. > > head: >i

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-17 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #15 from rguenther at suse dot de --- On Thu, 17 Jan 2019, wilco at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > --- Comment #14 from Wilco --- > (In reply to rguent...@suse.de from comment #13) >

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-17 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #14 from Wilco --- (In reply to rguent...@suse.de from comment #13) > Usually the peeling is done to improve branch prediction on the > prologue/epilogue. Modern branch predictors do much better on a loop than with this kind of code

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-17 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #13 from rguenther at suse dot de --- On Wed, 16 Jan 2019, ktkachov at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > --- Comment #11 from ktkachov at gcc dot gnu.org --- > Thank you all for the inpu

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-16 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #12 from ktkachov at gcc dot gnu.org --- (In reply to ktkachov from comment #11) > > As an experiment I hacked the AArch64 assembly of the function generated > with -funroll-loops to replace the peeled prologue version with a simple >

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-16 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #11 from ktkachov at gcc dot gnu.org --- Thank you all for the input. Just to add a bit of data. I've instrumented 510.parest_r to count the number of loop iterations to get a feel for how much of the unrolled loop is spent in the act

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-16 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 rsandifo at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed|

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-15 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #9 from rguenther at suse dot de --- On Tue, 15 Jan 2019, ktkachov at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > --- Comment #8 from ktkachov at gcc dot gnu.org --- > btw looks likes ICC vectoris

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-15 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #8 from ktkachov at gcc dot gnu.org --- btw looks likes ICC vectorises this as well as unrolling: ..B1.14: movl (%rcx,%rbx,4), %r15d vmovsd(%rdi,%r15,8), %xmm2

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-09 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #7 from Wilco --- (In reply to rguent...@suse.de from comment #6) > On Wed, 9 Jan 2019, wilco at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > > > --- Comment #5 from Wilco --- > > (In reply to

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-09 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #6 from rguenther at suse dot de --- On Wed, 9 Jan 2019, wilco at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > --- Comment #5 from Wilco --- > (In reply to Wilco from comment #4) > > (In reply to

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-09 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #5 from Wilco --- (In reply to Wilco from comment #4) > (In reply to ktkachov from comment #2) > > Created attachment 45386 [details] > > aarch64-llvm output with -Ofast -mcpu=cortex-a57 > > > > I'm attaching the full LLVM aarch64 ou

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-09 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #4 from Wil

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-09 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #3 from Richard Biener --- (In reply to ktkachov from comment #2) > Created attachment 45386 [details] > aarch64-llvm output with -Ofast -mcpu=cortex-a57 > > I'm attaching the full LLVM aarch64 output. > > The output you quoted is w

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-09 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #2 from ktkachov at gcc dot gnu.org --- Created attachment 45386 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45386&action=edit aarch64-llvm output with -Ofast -mcpu=cortex-a57 I'm attaching the full LLVM aarch64 output. The

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-09 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 --- Comment #1 from Richard Biener --- So LLVM unrolls 4 times while GCC (always) unrolls 8 times. The unrolled body for GCC (x86_64 this time) is .L4: movl(%rdx), %ecx vmovsd (%rax), %xmm8 addq$32, %rdx