RE: regs_used estimation in IVOPTS seriously flawed

2014-06-20 Thread Bingfeng Mei


> -Original Message-
> From: Bin.Cheng [mailto:amker.ch...@gmail.com]
> Sent: 20 June 2014 06:25
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: regs_used estimation in IVOPTS seriously flawed
> 
> On Tue, Jun 17, 2014 at 10:59 PM, Bingfeng Mei  wrote:
> > Hi,
> > I am looking at a performance regression in our code. A big loop
> produces
> > and uses a lot of temporary variables inside the loop body. The
> problem
> > appears that IVOPTS pass creates even more induction variables (from
> original
> > 2 to 27). It causes a lot of register spilling later and performance
> Do you have a simplified case which can be posted here?  I guess it
> affects some other targets too.
> 
> > take a severe hit. I looked into tree-ssa-loop-ivopts.c, it does call
> > estimate_reg_pressure_cost function to take # of registers into
> > consideration. The second parameter passed as data->regs_used is
> supposed
> > to represent old register usage before IVOPTS.
> >
> >   return size + estimate_reg_pressure_cost (size, data->regs_used,
> data->speed,
> > data->body_includes_call);
> >
> > In this case, it is mere 2 by following calculation. Essentially, it
> only counts
> > all loop invariant registers, ignoring all registers produced/used
> inside the loop.
> There are two kinds of registers produced/used inside the loop.  One
> is induction variable irrelevant, it includes non-linear uses as
> mentioned by Richard.  The other kind relates to induction variable
> rewrite, and one issue with this kind is expression generated during
> iv use rewriting is not reflecting the estimated one in ivopt very
> well.
> 

As a short term solution, I tried some simple non-linear functions as Richard 
suggested
to penalize using too many IVs. For example, the following cost in 
ivopts_global_cost_for_size fixed my regression and actually improves 
performance
slightly over a set of benchmarks we usually use. 

  return size * (1 + size * 0.2)
  + estimate_reg_pressure_cost (size, data->regs_used, data->speed,
   
data->body_includes_call); 

The trouble is choice of this non-linear function could be highly target 
dependent
(# of registers?). I don't have setup to prove performance gain for other 
targets.

I also tried counting all SSA names and divide it by a factor. It does seem to 
work
so well.

Long term, if we have infrastructure to analyze maximal live variable in a loop
at tree-level, that would be great for many loop optimizations.

Thanks,
Bingfeng


Re: regs_used estimation in IVOPTS seriously flawed

2014-06-20 Thread Bin.Cheng
On Fri, Jun 20, 2014 at 5:01 PM, Bingfeng Mei  wrote:
>
>
>> -Original Message-
>> From: Bin.Cheng [mailto:amker.ch...@gmail.com]
>> Sent: 20 June 2014 06:25
>> To: Bingfeng Mei
>> Cc: gcc@gcc.gnu.org
>> Subject: Re: regs_used estimation in IVOPTS seriously flawed
>>
>> On Tue, Jun 17, 2014 at 10:59 PM, Bingfeng Mei  wrote:
>> > Hi,
>> > I am looking at a performance regression in our code. A big loop
>> produces
>> > and uses a lot of temporary variables inside the loop body. The
>> problem
>> > appears that IVOPTS pass creates even more induction variables (from
>> original
>> > 2 to 27). It causes a lot of register spilling later and performance
>> Do you have a simplified case which can be posted here?  I guess it
>> affects some other targets too.
>>
>> > take a severe hit. I looked into tree-ssa-loop-ivopts.c, it does call
>> > estimate_reg_pressure_cost function to take # of registers into
>> > consideration. The second parameter passed as data->regs_used is
>> supposed
>> > to represent old register usage before IVOPTS.
>> >
>> >   return size + estimate_reg_pressure_cost (size, data->regs_used,
>> data->speed,
>> > data->body_includes_call);
>> >
>> > In this case, it is mere 2 by following calculation. Essentially, it
>> only counts
>> > all loop invariant registers, ignoring all registers produced/used
>> inside the loop.
>> There are two kinds of registers produced/used inside the loop.  One
>> is induction variable irrelevant, it includes non-linear uses as
>> mentioned by Richard.  The other kind relates to induction variable
>> rewrite, and one issue with this kind is expression generated during
>> iv use rewriting is not reflecting the estimated one in ivopt very
>> well.
>>
>
> As a short term solution, I tried some simple non-linear functions as
Richard suggested

Oh, I misread the non-linear way as non-linear iv uses.

> to penalize using too many IVs. For example, the following cost in
> ivopts_global_cost_for_size fixed my regression and actually improves 
> performance
> slightly over a set of benchmarks we usually use.

Great, I will try to tweak it on ARM.

>
>   return size * (1 + size * 0.2)
>   + estimate_reg_pressure_cost (size, data->regs_used, data->speed,
>
> data->body_includes_call);
>
> The trouble is choice of this non-linear function could be highly target 
> dependent
> (# of registers?). I don't have setup to prove performance gain for other 
> targets.
>
> I also tried counting all SSA names and divide it by a factor. It does seem 
> to work

So the number currently computed is the lower bound which is too
small.  Maybe it's possible to do some analysis with relatively low
cost increasing the number somehow.  While on the other hand, doesn't
bring restriction to IVOPT for loops with low register pressure.

Thanks,
bin

> so well.
>
> Long term, if we have infrastructure to analyze maximal live variable in a 
> loop
> at tree-level, that would be great for many loop optimizations.
>
> Thanks,
> Bingfeng



-- 
Best Regards.


RE: regs_used estimation in IVOPTS seriously flawed

2014-06-20 Thread Bingfeng Mei
Sorry, typo in previous mail. 

"I also tried counting all SSA names and divide it by a factor. It does
NOT seem to work so well"

> -Original Message-
> From: Bin.Cheng [mailto:amker.ch...@gmail.com]
> Sent: 20 June 2014 10:19
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: regs_used estimation in IVOPTS seriously flawed
> 
> On Fri, Jun 20, 2014 at 5:01 PM, Bingfeng Mei  wrote:
> >
> >
> >> -Original Message-
> >> From: Bin.Cheng [mailto:amker.ch...@gmail.com]
> >> Sent: 20 June 2014 06:25
> >> To: Bingfeng Mei
> >> Cc: gcc@gcc.gnu.org
> >> Subject: Re: regs_used estimation in IVOPTS seriously flawed
> >>
> >> On Tue, Jun 17, 2014 at 10:59 PM, Bingfeng Mei 
> wrote:
> >> > Hi,
> >> > I am looking at a performance regression in our code. A big loop
> >> produces
> >> > and uses a lot of temporary variables inside the loop body. The
> >> problem
> >> > appears that IVOPTS pass creates even more induction variables
> (from
> >> original
> >> > 2 to 27). It causes a lot of register spilling later and
> performance
> >> Do you have a simplified case which can be posted here?  I guess it
> >> affects some other targets too.
> >>
> >> > take a severe hit. I looked into tree-ssa-loop-ivopts.c, it does
> call
> >> > estimate_reg_pressure_cost function to take # of registers into
> >> > consideration. The second parameter passed as data->regs_used is
> >> supposed
> >> > to represent old register usage before IVOPTS.
> >> >
> >> >   return size + estimate_reg_pressure_cost (size, data->regs_used,
> >> data->speed,
> >> > data-
> >body_includes_call);
> >> >
> >> > In this case, it is mere 2 by following calculation. Essentially,
> it
> >> only counts
> >> > all loop invariant registers, ignoring all registers produced/used
> >> inside the loop.
> >> There are two kinds of registers produced/used inside the loop.  One
> >> is induction variable irrelevant, it includes non-linear uses as
> >> mentioned by Richard.  The other kind relates to induction variable
> >> rewrite, and one issue with this kind is expression generated during
> >> iv use rewriting is not reflecting the estimated one in ivopt very
> >> well.
> >>
> >
> > As a short term solution, I tried some simple non-linear functions as
> Richard suggested
> 
> Oh, I misread the non-linear way as non-linear iv uses.
> 
> > to penalize using too many IVs. For example, the following cost in
> > ivopts_global_cost_for_size fixed my regression and actually improves
> performance
> > slightly over a set of benchmarks we usually use.
> 
> Great, I will try to tweak it on ARM.
> 
> >
> >   return size * (1 + size * 0.2)
> >   + estimate_reg_pressure_cost (size, data->regs_used, data-
> >speed,
> >data-
> >body_includes_call);
> >
> > The trouble is choice of this non-linear function could be highly
> target dependent
> > (# of registers?). I don't have setup to prove performance gain for
> other targets.
> >
> > I also tried counting all SSA names and divide it by a factor. It does
> seem to work
> 
> So the number currently computed is the lower bound which is too
> small.  Maybe it's possible to do some analysis with relatively low
> cost increasing the number somehow.  While on the other hand, doesn't
> bring restriction to IVOPT for loops with low register pressure.
> 
> Thanks,
> bin
> 
> > so well.
> >
> > Long term, if we have infrastructure to analyze maximal live variable
> in a loop
> > at tree-level, that would be great for many loop optimizations.
> >
> > Thanks,
> > Bingfeng
> 
> 
> 
> --
> Best Regards.


Re: regs_used estimation in IVOPTS seriously flawed

2014-06-20 Thread David Edelsohn
On Fri, Jun 20, 2014 at 5:01 AM, Bingfeng Mei  wrote:

> As a short term solution, I tried some simple non-linear functions as Richard 
> suggested
> to penalize using too many IVs. For example, the following cost in
> ivopts_global_cost_for_size fixed my regression and actually improves 
> performance
> slightly over a set of benchmarks we usually use.
>
>   return size * (1 + size * 0.2)
>   + estimate_reg_pressure_cost (size, data->regs_used, data->speed,
>
> data->body_includes_call);
>
> The trouble is choice of this non-linear function could be highly target 
> dependent
> (# of registers?). I don't have setup to prove performance gain for other 
> targets.
>
> I also tried counting all SSA names and divide it by a factor. It does seem 
> to work
> so well.
>
> Long term, if we have infrastructure to analyze maximal live variable in a 
> loop
> at tree-level, that would be great for many loop optimizations.

I assume that you are going to parameterize the scaling so that it can
be tuned for each target.

Maybe Aaron's live range approximation can improve the estimate.

- David


Re: Offload Library

2014-06-20 Thread David Edelsohn
On Fri, May 16, 2014 at 7:47 AM, Kirill Yukhin  wrote:
> Dear steering committee,
>
> To support the offloading features for Intel's Xeon Phi cards
> we need to add a foreign library (liboffload) into the gcc repository.
> README with build instructions is attached.
>
> I am also copy-pasting the header comment from one of the liboffload files.
> The header shown below will be in all the source files in liboffload.
>
> Sources can be downloaded from [1].
>
> Additionally to that sources we going to add few headers (released under GPL 
> v2.1 license)
> and couple of new sources (license in the bottom of the message).
>
> Does this look OK?

The GCC SC has decided to allow this library in the GCC sources.

If the library is not going to be expanded to support all GPUs and
offload targets, the library name should be more specific to Intel.

Thanks, David


Re: Offload Library

2014-06-20 Thread Joel Sherrill

On 6/20/2014 1:46 PM, David Edelsohn wrote:
> On Fri, May 16, 2014 at 7:47 AM, Kirill Yukhin  
> wrote:
>> Dear steering committee,
>>
>> To support the offloading features for Intel's Xeon Phi cards
>> we need to add a foreign library (liboffload) into the gcc repository.
>> README with build instructions is attached.
>>
>> I am also copy-pasting the header comment from one of the liboffload files.
>> The header shown below will be in all the source files in liboffload.
>>
>> Sources can be downloaded from [1].
>>
>> Additionally to that sources we going to add few headers (released under GPL 
>> v2.1 license)
>> and couple of new sources (license in the bottom of the message).
>>
>> Does this look OK?
> The GCC SC has decided to allow this library in the GCC sources.
>
> If the library is not going to be expanded to support all GPUs and
> offload targets, the library name should be more specific to Intel.
That matches what I understood and should have said Yes to.
> Thanks, David

-- 
Joel Sherrill, Ph.D. Director of Research & Development
joel.sherr...@oarcorp.comOn-Line Applications Research
Ask me about RTEMS: a free RTOS  Huntsville AL 35805
Support Available(256) 722-9985