https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #43 from Jiu Fu Guo ---
Author: guojiufu
Date: Mon Nov 11 06:30:38 2019
New Revision: 278034
URL: https://gcc.gnu.org/viewcvs?rev=278034&root=gcc&view=rev
Log:
rs6000: Refine small loop unroll in loop_unroll_adjust hook
In this pat
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #42 from Jiu Fu Guo ---
Author: guojiufu
Date: Mon Oct 28 05:23:24 2019
New Revision: 277501
URL: https://gcc.gnu.org/viewcvs?rev=277501&root=gcc&view=rev
Log:
rs6000: Enable limited unrolling at -O2
In PR88760, there are a few diss
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #41 from Jiu Fu Guo ---
for code:
subroutine foo (i, i1, block)
integer :: i, i1
integer :: block(9, 9, 9)
block(i:9,1,i1) = block(i:9,1,i1) - 10
end subroutine foo
"-funroll-loops --param max-unroll-times=2 --param
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #40 from rguenther at suse dot de ---
On Sat, 12 Oct 2019, guojiufu at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
>
> --- Comment #39 from Jiu Fu Guo ---
> For small loop (1-2 stmts), in forms of GI
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #39 from Jiu Fu Guo ---
For small loop (1-2 stmts), in forms of GIMPLE and RTL, it would be around 5-10
instructions: 2-4 insns per stmt, ~4 insns for idx.
With current unroller, here is a statistic on spec2017.
Using --param max-un
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #38 from rguenther at suse dot de ---
On Fri, 11 Oct 2019, segher at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
>
> --- Comment #37 from Segher Boessenkool ---
> -- If it is done in RTL it should re
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #37 from Segher Boessenkool ---
-- If it is done in RTL it should really be done earlier, it doesn't get all
optimisations it should right now.
-- Unrolling small loops more aggressively (at -O2 even) perhaps needs to be
done at a di
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #36 from rguenther at suse dot de ---
On Fri, 11 Oct 2019, wilco at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
>
> --- Comment #32 from Wilco ---
> (In reply to Segher Boessenkool from comment #31)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #35 from rguenther at suse dot de ---
On Fri, 11 Oct 2019, wilco at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
>
> --- Comment #34 from Wilco ---
> (In reply to rguent...@suse.de from comment #30)
>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #34 from Wilco ---
(In reply to rguent...@suse.de from comment #30)
> On Fri, 11 Oct 2019, wilco at gcc dot gnu.org wrote:
>
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> >
> > --- Comment #29 from Wilco ---
> > (In repl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #33 from rguenther at suse dot de ---
On Fri, 11 Oct 2019, segher at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
>
> --- Comment #31 from Segher Boessenkool ---
> Gimple passes know a lot about machi
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #32 from Wilco ---
(In reply to Segher Boessenkool from comment #31)
> Gimple passes know a lot about machine details, too.
>
> Irrespective of if this is "low-level" or "high-level", it should be done
> earlier than it is now. It s
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #31 from Segher Boessenkool ---
Gimple passes know a lot about machine details, too.
Irrespective of if this is "low-level" or "high-level", it should be done
earlier than it is now. It should either be done right after expand, or
s
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #30 from rguenther at suse dot de ---
On Fri, 11 Oct 2019, wilco at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
>
> --- Comment #29 from Wilco ---
> (In reply to Jiu Fu Guo from comment #28)
> > For
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #29 from Wilco ---
(In reply to Jiu Fu Guo from comment #28)
> For these kind of small loops, it would be acceptable to unroll in GIMPLE,
> because register pressure and instruction cost may not be major concerns;
> just like "cunrol
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
Jiu Fu Guo changed:
What|Removed |Added
CC||guojiufu at gcc dot gnu.org
--- Comment #28
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
Wilco changed:
What|Removed |Added
CC||wdijkstr at arm dot com
--- Comment #27 from Wil
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #26 from Segher Boessenkool ---
Yeah, and it probably should be a param (that different targets can default
differently, per CPU probably). On most Power CPUs all loops take a minimum
number of cycles per iteration (say, three), but
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #25 from Richard Earnshaw ---
Well very small loops should be unrolled more than slightly larger ones. So
perhaps if the loop body is only 3 or 4 instructions it should be unrolled four
times but above that perhaps only twice.
Some
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #24 from Segher Boessenkool ---
On some (many?) targets it would be good to unroll all loops with a small body
(not containing calls etc.) at -O2 already, some small number of times (2 or 4
maybe).
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #23 from Wilco ---
(In reply to ktkachov from comment #22)
> helps even more. On Cortex-A72 it gives a bit more than 6% (vs 3%)
> improvement on parest, and about 5.3% on a more aggressive CPU.
> I tried unrolling 8x in a similar man
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #22 from ktkachov at gcc dot gnu.org ---
Some more experiments...
Unrolling 4x in a similar way to my previous example and not splitting the
accumulator (separate issue):
unsigned int *colnums;
double *val;
struct foostruct
{
unsig
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #21 from Wilco ---
(In reply to rguent...@suse.de from comment #20)
> On Thu, 24 Jan 2019, wilco at gcc dot gnu.org wrote:
>
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> >
> > --- Comment #19 from Wilco ---
> > (In repl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #20 from rguenther at suse dot de ---
On Thu, 24 Jan 2019, wilco at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
>
> --- Comment #19 from Wilco ---
> (In reply to rguent...@suse.de from comment #18)
>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #19 from Wilco ---
(In reply to rguent...@suse.de from comment #18)
> > 1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's
> > suggestion)
>
> If that helps, sure (I'd have guessed uarchs are going to split
> load
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #18 from rguenther at suse dot de ---
On Thu, 24 Jan 2019, ktkachov at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
>
> --- Comment #17 from ktkachov at gcc dot gnu.org ---
> I played around with the s
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #17 from ktkachov at gcc dot gnu.org ---
I played around with the source to do some conservative 2x manual unrolling in
the two hottest functions in 510.parest_r (3 more-or-less identical tight FMA
loops). This was to try out Richard's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #16 from Wilco ---
(In reply to rguent...@suse.de from comment #15)
> which is what I refered to for branch prediction. Your & prompts me
> to a way to do sth similar as duffs device, turning the loop into a nest.
>
> head:
>i
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #15 from rguenther at suse dot de ---
On Thu, 17 Jan 2019, wilco at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
>
> --- Comment #14 from Wilco ---
> (In reply to rguent...@suse.de from comment #13)
>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #14 from Wilco ---
(In reply to rguent...@suse.de from comment #13)
> Usually the peeling is done to improve branch prediction on the
> prologue/epilogue.
Modern branch predictors do much better on a loop than with this kind of code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #13 from rguenther at suse dot de ---
On Wed, 16 Jan 2019, ktkachov at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
>
> --- Comment #11 from ktkachov at gcc dot gnu.org ---
> Thank you all for the inpu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #12 from ktkachov at gcc dot gnu.org ---
(In reply to ktkachov from comment #11)
>
> As an experiment I hacked the AArch64 assembly of the function generated
> with -funroll-loops to replace the peeled prologue version with a simple
>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #11 from ktkachov at gcc dot gnu.org ---
Thank you all for the input.
Just to add a bit of data.
I've instrumented 510.parest_r to count the number of loop iterations to get a
feel for how much of the unrolled loop is spent in the act
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
rsandifo at gcc dot gnu.org changed:
What|Removed |Added
Status|UNCONFIRMED |NEW
Last reconfirmed|
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #9 from rguenther at suse dot de ---
On Tue, 15 Jan 2019, ktkachov at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
>
> --- Comment #8 from ktkachov at gcc dot gnu.org ---
> btw looks likes ICC vectoris
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #8 from ktkachov at gcc dot gnu.org ---
btw looks likes ICC vectorises this as well as unrolling:
..B1.14:
movl (%rcx,%rbx,4), %r15d
vmovsd(%rdi,%r15,8), %xmm2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #7 from Wilco ---
(In reply to rguent...@suse.de from comment #6)
> On Wed, 9 Jan 2019, wilco at gcc dot gnu.org wrote:
>
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> >
> > --- Comment #5 from Wilco ---
> > (In reply to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #6 from rguenther at suse dot de ---
On Wed, 9 Jan 2019, wilco at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
>
> --- Comment #5 from Wilco ---
> (In reply to Wilco from comment #4)
> > (In reply to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #5 from Wilco ---
(In reply to Wilco from comment #4)
> (In reply to ktkachov from comment #2)
> > Created attachment 45386 [details]
> > aarch64-llvm output with -Ofast -mcpu=cortex-a57
> >
> > I'm attaching the full LLVM aarch64 ou
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
Wilco changed:
What|Removed |Added
CC||wilco at gcc dot gnu.org
--- Comment #4 from Wil
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #3 from Richard Biener ---
(In reply to ktkachov from comment #2)
> Created attachment 45386 [details]
> aarch64-llvm output with -Ofast -mcpu=cortex-a57
>
> I'm attaching the full LLVM aarch64 output.
>
> The output you quoted is w
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #2 from ktkachov at gcc dot gnu.org ---
Created attachment 45386
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45386&action=edit
aarch64-llvm output with -Ofast -mcpu=cortex-a57
I'm attaching the full LLVM aarch64 output.
The
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #1 from Richard Biener ---
So LLVM unrolls 4 times while GCC (always) unrolls 8 times. The unrolled body
for GCC (x86_64 this time) is
.L4:
movl(%rdx), %ecx
vmovsd (%rax), %xmm8
addq$32, %rdx
43 matches
Mail list logo