date:20140616

Re: Help understand the may_be_zero field in loop niter information

2014-06-16 Thread Bin.Cheng

On Thu, Jun 12, 2014 at 7:59 PM, Zdenek Dvorak  wrote:
> Hi,
>
>> > I noticed there is below code/comments about may_be_zero field in loop
>> > niter desc:
>> >
>> >   tree may_be_zero;/* The boolean expression.  If it evaluates to true,
>> >the loop will exit in the first iteration (i.e.
>> >its latch will not be executed), even if the niter
>> >field says otherwise.  */
>> >
>> > I had difficulty in understanding this because I ran into some cases
>> > in which it didn't behave as said.
>
> actually, in all the examples below, the field behaves as described,
> i.e.,
>
> the number of iterations = may_be_zero ? 0 : niter;
>
> In particular, the fact that may_be_zero is false *does not imply*
> that the number of iterations as described by niter is non-zero.
>
>> > Example1, the dump of loop before sccp is like:
>> >
>> >   :
>> >   bnd_4 = len_3(D) + 1;
>> >
>> >   :
>> >   # ivtmp_1 = PHI <0(2), ivtmp_11(4)>
>> >   _6 = ivtmp_1 + len_3(D);
>> >   _7 = a[ivtmp_1];
>> >   _8 = b[ivtmp_1];
>> >   _9 = _7 + _8;
>> >   a[_6] = _9;
>> >   ivtmp_11 = ivtmp_1 + 1;
>> >   if (bnd_4 > ivtmp_11)
>> > goto ;
>> >   else
>> > goto ;
>> >
>> >   :
>> >   goto ;
>> >
>> > The loop niter information analyzed in sccp is like:
>> >
>> > Analyzing # of iterations of loop 1
>> >   exit condition [1, + , 1] < len_3(D) + 1
>> >   bounds on difference of bases: -1 ... 4294967294
>> >   result:
>> > zero if len_3(D) == 4294967295
>> > # of iterations len_3(D), bounded by 4294967294
>> >
>> > Qeustion1, shouldn't it be like "len_3 +1 <= 1" because the latch
>> > won't be executed when "len_3 == 0", right?
>
> the analysis determines the number of iterations as len_3, that is
> 0 if len_3 == 0.  So, the information is computed correctly here.
>
>> > But when boundary condition is the only case that latch get ZERO
>> > executed, the may_be_zero info will not be computed.  See example2,
>> > with dump of loop before sccp like:
>> >
>> > foo (int M)
>> >
>> >   :
>> >   if (M_4(D) > 0)
>> > goto ;
>> >   else
>> > goto ;
>> >
>> >   :
>> >   return;
>> >
>> >   :
>> >
>> >   :
>> >   # i_13 = PHI <0(4), i_10(6)>
>> >   _5 = i_13 + M_4(D);
>> >   _6 = a[i_13];
>> >   _7 = b[i_13];
>> >   _8 = _6 + _7;
>> >   a[_5] = _8;
>> >   i_10 = i_13 + 1;
>> >   if (M_4(D) > i_10)
>> > goto ;
>> >   else
>> > goto ;
>> >
>> >   :
>> >   goto ;
>> >
>> > The niter information analyzed in sccp is like:
>> >
>> > Analyzing # of iterations of loop 1
>> >   exit condition [1, + , 1](no_overflow) < M_4(D)
>> >   bounds on difference of bases: 0 ... 2147483646
>> >   result:
>> > # of iterations (unsigned int) M_4(D) + 4294967295, bounded by 
>> > 2147483646
>> >
>> > So may_be_zero is always false here, but the latch may be ZERO
>> > executed when "M_4 == 1".
>
> Again, this is correct, since then ((unsigned int) M_4) + 4294967295 == 0.
>
>> > Start from Example1, we can create Example3 which makes no sense to
>> > me.  Again, the dump of loop is like:
>> >
>> >   :
>> >   bnd_4 = len_3(D) + 1;
>> >
>> >   :
>> >   # ivtmp_1 = PHI <0(2), ivtmp_11(4)>
>> >   _6 = ivtmp_1 + len_3(D);
>> >   _7 = a[ivtmp_1];
>> >   _8 = b[ivtmp_1];
>> >   _9 = _7 + _8;
>> >   a[_6] = _9;
>> >   ivtmp_11 = ivtmp_1 + 4;
>> >   if (bnd_4 > ivtmp_11)
>> > goto ;
>> >   else
>> > goto ;
>> >
>> >   :
>> >   goto ;
>> >
>> >   :
>> >   return 0;
>> >
>> > The niter info is like:
>> >
>> > Analyzing # of iterations of loop 1
>> >   exit condition [4, + , 4] < len_3(D) + 1
>> >   bounds on difference of bases: -4 ... 4294967291
>> >   result:
>> > under assumptions len_3(D) + 1 <= 4294967292
>> > zero if len_3(D) == 4294967295
>> > # of iterations len_3(D) / 4, bounded by 1073741823
>> >
>> > The problem is: won't latch be ZERO executed when "len_3 == 0/1/2/3"?
>
> Again, in all these cases the number of iterations is len_3 / 4 == 0.
>
> Zdenek

Hi Zdenek, I spent some more time pondering over this and I think I
understand the (at least one) motivation why may_be_zero acts as it is
now.  At least for IVOPT, the boundary condition for which loop latch
is not executed doesn't need to be handled specially when trying to
eliminate condition iv uses.

So, I am thinking if it's ok for me to send a documentation patch to
describe how it works since it's a little bit confusing for me at the
first glance.

Thanks,
bin



-- 
Best Regards.

Re: [PATCH] tell gcc optimizer to never introduce new data races

2014-06-16 Thread Dan Carpenter

Adding "--param allow-store-data-races=0" to the GCC options for the
kernel breaks C=1 because Sparse isn't expecting a GCC option with that
format.

It thinks allow-store-data-races=0 is the name of the file we are trying
to test.  Try use Sparse on linux-next to see the problem.

$ make C=2 mm/slab_common.o
  CHK include/config/kernel.release
  CHK include/generated/uapi/linux/version.h
  CHK include/generated/utsrelease.h
  CALLscripts/checksyscalls.sh
  CHECK   scripts/mod/empty.c
No such file: allow-store-data-races=0
make[2]: *** [scripts/mod/empty.o] Error 1
make[1]: *** [scripts/mod] Error 2
make: *** [scripts] Error 2
$

regards,
dan carpenter

Re: [PATCH] tell gcc optimizer to never introduce new data races

2014-06-16 Thread Andreas Schwab

Dan Carpenter  writes:

> Adding "--param allow-store-data-races=0" to the GCC options for the
> kernel breaks C=1 because Sparse isn't expecting a GCC option with that
> format.

Please try --param=allow-store-data-races=0 instead.

Andreas.

-- 
Andreas Schwab, SUSE Labs, sch...@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."

Re: [GSoC] decision tree first steps

2014-06-16 Thread Richard Biener

On Mon, Jun 16, 2014 at 1:07 AM, Prathamesh Kulkarni
 wrote:
> On Sat, Jun 14, 2014 at 12:43 PM, Richard Biener
>  wrote:
> I have attached patch that tries to implement decision tree using the
> above algorithm.
> (haven't done for built-in function yet, but that would be similar to
> expr, so i guess no new issues may come up
> for that).

Great.

> * AST representation
> Added two more classes to AST - true_operand and match_operand to
> represent "true" and "match" operands
> respectively. captures are built during parsing, and are "lowered" to
> either true_operand or match_operand
> while inserting AST operands in decision tree (lower_capture).

Hmm, ok.  I'd have made them decision tree node classes instead,
but it's a matter of taste I guess.

+  // or maybe keep a parallel bool indexes_empty array instead of
using capture_max to denote "not seen" ?
+  for (unsigned i = 0; i < capture_max; ++i)
+   indexes[i] = level_max;

using a special value is fine.

> * Mapping capture index to preorder level
> dt_simplify::indexes (unsigned *indexes) provides mapping from capture
> index -> level.
> eg: indexes[1] = 2 represent @1 is at level 2 in preorder traversal of AST.
>
> * true_operand is always placed as last child of the decision tree
> node during insertion (dt_node::append_node),
> since we want to process that last (if all other decisions fail).

right.

> * Code gen
> Unfortunately, the patch still contains hacks for code-gen.
> One such hack is adding three fields - (parent, preorder_level, pos) to 
> operand.
> They should really be part of decision tree, but since code-gen happens off 
> AST,
> I needed to place them there.
> For removing that, I am thinking to put information required for
> code-gen in another struct (say operand_info?)
> struct operand_info
> {
>   operand *op;
>   unsigned pos;
>   operand *parent;
>   unsigned preorder_level;
> };

Eventually you can just pass the info to the code-generators as extra
arguments?  That is, I would get rid of the AST methods for generating
the matching code and just do everything in the DT traversal.  That is,
find a better abstraction here.

> a) The metadata of operand (pos, parent, preorder_level) can be
> computed during preorder_traversal
> in walk_operand_preorder.
> b) Stick operand_info into decision tree (dt_operand) instead of operand.
> Is that fine ?

Yes, that would work, but as code-gen off the DT should be quite
simple I'd rather not complicate things with too much C++ abstraction
(yeah, it's probably my fault to introduce it in the first place).

> Code-gen for operands is slightly changed.
> the temporary is created at expression's operand node rather than at
> the expression's node itself.
> Each operand knows it's name.
>
> It's name is computed as follows (dt_operand::gen_gimple):
> opname = op (if operand's parent is root).
> or opname = o if operand's parent is
> true_operand or match_operand
> or opname = gimple_assign_rhs (def_stmt parent node>); // if operand's parent is non-root expr
> for built-in functions it would be:
> or opname = gimple_call_arg (def_stmt, );

Hmm, in code-gen I see

  if (code == MINUS_EXPR)
{
{
  tree o1 = op0;
  if (TREE_CODE (o1) == SSA_NAME)
{
  gimple def_stmt1 = SSA_NAME_DEF_STMT (o1);
  if (is_gimple_assign (def_stmt1) &&
gimple_assign_rhs_code (def_stmt1) == PLUS_EXPR)
{
...
}
}
}
{
  tree o1 = op0;
  if (TREE_CODE (o1) == SSA_NAME)
{
  gimple def_stmt1 = SSA_NAME_DEF_STMT (o1);
  if (is_gimple_assign (def_stmt1) &&
gimple_assign_rhs_code (def_stmt1) == MINUS_EXPR)
{
...

for the DT part

root, 2
|--operand: MINUS_EXPR, 2
|operand: PLUS_EXPR, 1
...
|operand: MINUS_EXPR, 1
...

but I would have expected the preamble for the inner
PLUS_EXPR/MINUS_EXPR check to be unified.  Thus

if (code == MINUS_EXPR)
  {
tree o1 = op0;
if (TREE_CODE (op1) == SSA_NAME)
  {
 gimple def_stmt1 = SSA_NAME_DEF_STMT (o1);
 if (is_gimple_assign (def_stmt1))
   {
  if (gimple_assign_rhs_code (def_stmt1) == PLUS_EXPR)
{
...
}
  else if (gimple_assign_rhs_code (def_stmt) == MINUS_EXPR)
{
...
}

That means a better factoring of code-generation would be necessary,
with possibly sorting the kids array after operand kind.


> * Added do_valueize () in gimple-match-head.c. the generated code
> calls do_valueize to valueize theopereand.
> This make code-gen simpler (no goto).

Good.

> Example:
> for the pattern:
> (match_and_simplify
>   (minus (plus @0 @1) @1)
>   @0)
>
> it produces following code (literally taken from gimple-match.c after
> running thru indent):
> http://pastebin.com/EaFHZMAF
>
> For non-matching captures (capt->what->type == operand::OP_EXPR), I
> tested with few
> bog

Re: [PATCH] tell gcc optimizer to never introduce new data races

2014-06-16 Thread Jiri Kosina

On Mon, 16 Jun 2014, Andreas Schwab wrote:

> > Adding "--param allow-store-data-races=0" to the GCC options for the
> > kernel breaks C=1 because Sparse isn't expecting a GCC option with that
> > format.
> 
> Please try --param=allow-store-data-races=0 instead.

How reliable is this format across GCC versions? GCC manpage doesn't seem 
to list it as a valid alternative.

-- 
Jiri Kosina
SUSE Labs

vector load Rematerialization!!

2014-06-16 Thread Ajit Kumar Agarwal


Hello All:

There has been work done for load rematerialization. Instead of Store and Load 
of variables they kept in registers for the Live range.  Till now we are doing 
the rematerialization of scalar loads. 
Is it feasible to have rematerialization for the vector Loads? This will be 
helpful  to reduce the vectorized Store and Load for the dependencies across 
the vectorized Loops. I was looking at one of the presentation where there is a 
mentioned about the Load rematerialization is implemented from GCC 4.8.2 
Onwards. 

Does this implementation takes care of rematerialization of vector Loads. Can 
we have this approach?

Please let me know what do you think.

Thanks & Regards
Ajit

Register Pressure guided Unroll and Jam in GCC !!

2014-06-16 Thread Ajit Kumar Agarwal

Hello All:
 
I have worked on the Open64 compiler where the Register Pressure Guided Unroll 
and Jam gave a good amount of performance improvement for the  C and C++ Spec 
Benchmark and also Fortran benchmarks.

The Unroll and Jam increases the register pressure in the Unrolled Loop leading 
to increase in the Spill and Fetch degrading the performance of the Unrolled 
Loop. The Performance of Cache locality achieved through Unroll and Jam is 
degraded with the presence of Spilling instruction due to increases in register 
pressure Its better to do the decision  of Unrolled Factor of the Loop based on 
the Performance model of the register pressure.

Most of the Loop Optimization Like Unroll and Jam is implemented in the High 
Level IR. The register pressure based Unroll and Jam requires the calculation 
of register pressure in the High Level IR  which will be similar to register 
pressure we calculate on Register Allocation. This makes the implementation 
complex.

To overcome this, the Open64 compiler does the decision of Unrolling to both 
High Level IR and also at the Code Generation Level. Some of the decisions way 
at the end of the Code Generation . The advantage of using this approach like 
Open64 helps in using the register pressure information calculated by the 
Register Allocator. This helps the implementation much simpler and less complex.

Can we have this approach in GCC of the Decisions of Unroll and Jam in the High 
Level IR  and also to defer some of the decision at the Code Generation Level 
like Open64? 

 Please let me know what do you think.

Thanks & Regards
Ajit

Re: [GSoC] decision tree first steps

2014-06-16 Thread Michael Matz

Hi,

On Mon, 16 Jun 2014, Richard Biener wrote:

> For
> 
> (match_and_simplify
>   (MINUS_EXPR @2 (PLUS_EXPR@2 @0 @1))
>   @1)

Btw, this just triggered my eye.  So with lumping the predicate to the 
capture without special separator syntax, it means that there's a 
difference between "minus_expr @2" and "minus_expr@2" with a meaningful 
whitespace (despite 'r' and '@' already being a natural word boundary), 
which seems less than ideal.  Just mentioning :)

Ciao,
Michael.

Re: Register Pressure guided Unroll and Jam in GCC !!

2014-06-16 Thread Richard Biener

On Mon, Jun 16, 2014 at 4:14 PM, Ajit Kumar Agarwal
 wrote:
> Hello All:
>
> I have worked on the Open64 compiler where the Register Pressure Guided 
> Unroll and Jam gave a good amount of performance improvement for the  C and 
> C++ Spec Benchmark and also Fortran benchmarks.
>
> The Unroll and Jam increases the register pressure in the Unrolled Loop 
> leading to increase in the Spill and Fetch degrading the performance of the 
> Unrolled Loop. The Performance of Cache locality achieved through Unroll and 
> Jam is degraded with the presence of Spilling instruction due to increases in 
> register pressure Its better to do the decision  of Unrolled Factor of the 
> Loop based on the Performance model of the register pressure.
>
> Most of the Loop Optimization Like Unroll and Jam is implemented in the High 
> Level IR. The register pressure based Unroll and Jam requires the calculation 
> of register pressure in the High Level IR  which will be similar to register 
> pressure we calculate on Register Allocation. This makes the implementation 
> complex.
>
> To overcome this, the Open64 compiler does the decision of Unrolling to both 
> High Level IR and also at the Code Generation Level. Some of the decisions 
> way at the end of the Code Generation . The advantage of using this approach 
> like Open64 helps in using the register pressure information calculated by 
> the Register Allocator. This helps the implementation much simpler and less 
> complex.
>
> Can we have this approach in GCC of the Decisions of Unroll and Jam in the 
> High Level IR  and also to defer some of the decision at the Code Generation 
> Level like Open64?
>
>  Please let me know what do you think.

Sure, you can for example compute validity of the transform during
the GIMPLE loop opts, annotate the loop meta-information with
the desired transform and apply it (or not) later during RTL unrolling.

Richard.

> Thanks & Regards
> Ajit

Re: [PATCH] tell gcc optimizer to never introduce new data races

2014-06-16 Thread Mark Brown

On Mon, Jun 16, 2014 at 12:52:10PM +0200, Andreas Schwab wrote:
> Dan Carpenter  writes:

> > Adding "--param allow-store-data-races=0" to the GCC options for the
> > kernel breaks C=1 because Sparse isn't expecting a GCC option with that
> > format.

> Please try --param=allow-store-data-races=0 instead.

That appears to work for me.


signature.asc
Description: Digital signature

RE: Register Pressure guided Unroll and Jam in GCC !!

2014-06-16 Thread Ajit Kumar Agarwal



-Original Message-
From: Richard Biener [mailto:richard.guent...@gmail.com] 
Sent: Monday, June 16, 2014 7:55 PM
To: Ajit Kumar Agarwal
Cc: gcc@gcc.gnu.org; Vladimir Makarov; Michael Eager; Vinod Kathail; Shail 
Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: Register Pressure guided Unroll and Jam in GCC !!

On Mon, Jun 16, 2014 at 4:14 PM, Ajit Kumar Agarwal 
 wrote:
> Hello All:
>
> I have worked on the Open64 compiler where the Register Pressure Guided 
> Unroll and Jam gave a good amount of performance improvement for the  C and 
> C++ Spec Benchmark and also Fortran benchmarks.
>
> The Unroll and Jam increases the register pressure in the Unrolled Loop 
> leading to increase in the Spill and Fetch degrading the performance of the 
> Unrolled Loop. The Performance of Cache locality achieved through Unroll and 
> Jam is degraded with the presence of Spilling instruction due to increases in 
> register pressure Its better to do the decision  of Unrolled Factor of the 
> Loop based on the Performance model of the register pressure.
>
> Most of the Loop Optimization Like Unroll and Jam is implemented in the High 
> Level IR. The register pressure based Unroll and Jam requires the calculation 
> of register pressure in the High Level IR  which will be similar to register 
> pressure we calculate on Register Allocation. This makes the implementation 
> complex.
>
> To overcome this, the Open64 compiler does the decision of Unrolling to both 
> High Level IR and also at the Code Generation Level. Some of the decisions 
> way at the end of the Code Generation . The advantage of using this approach 
> like Open64 helps in using the register pressure information calculated by 
> the Register Allocator. This helps the implementation much simpler and less 
> complex.
>
> Can we have this approach in GCC of the Decisions of Unroll and Jam in the 
> High Level IR  and also to defer some of the decision at the Code Generation 
> Level like Open64?
>
>  Please let me know what do you think.

>>Sure, you can for example compute validity of the transform during the GIMPLE 
>>loop opts, annotate the loop meta-information with the desired transform and 
>>apply it (or not) later >>during RTL unrolling.

Thanks !! Has RTL unrolling been already implemented?

Richard.

> Thanks & Regards
> Ajit

RE: Register Pressure guided Unroll and Jam in GCC !!

2014-06-16 Thread Richard Biener

On June 16, 2014 6:39:58 PM CEST, Ajit Kumar Agarwal 
 wrote:
>
>
>-Original Message-
>From: Richard Biener [mailto:richard.guent...@gmail.com] 
>Sent: Monday, June 16, 2014 7:55 PM
>To: Ajit Kumar Agarwal
>Cc: gcc@gcc.gnu.org; Vladimir Makarov; Michael Eager; Vinod Kathail;
>Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
>Subject: Re: Register Pressure guided Unroll and Jam in GCC !!
>
>On Mon, Jun 16, 2014 at 4:14 PM, Ajit Kumar Agarwal
> wrote:
>> Hello All:
>>
>> I have worked on the Open64 compiler where the Register Pressure
>Guided Unroll and Jam gave a good amount of performance improvement for
>the  C and C++ Spec Benchmark and also Fortran benchmarks.
>>
>> The Unroll and Jam increases the register pressure in the Unrolled
>Loop leading to increase in the Spill and Fetch degrading the
>performance of the Unrolled Loop. The Performance of Cache locality
>achieved through Unroll and Jam is degraded with the presence of
>Spilling instruction due to increases in register pressure Its better
>to do the decision  of Unrolled Factor of the Loop based on the
>Performance model of the register pressure.
>>
>> Most of the Loop Optimization Like Unroll and Jam is implemented in
>the High Level IR. The register pressure based Unroll and Jam requires
>the calculation of register pressure in the High Level IR  which will
>be similar to register pressure we calculate on Register Allocation.
>This makes the implementation complex.
>>
>> To overcome this, the Open64 compiler does the decision of Unrolling
>to both High Level IR and also at the Code Generation Level. Some of
>the decisions way at the end of the Code Generation . The advantage of
>using this approach like Open64 helps in using the register pressure
>information calculated by the Register Allocator. This helps the
>implementation much simpler and less complex.
>>
>> Can we have this approach in GCC of the Decisions of Unroll and Jam
>in the High Level IR  and also to defer some of the decision at the
>Code Generation Level like Open64?
>>
>>  Please let me know what do you think.
>
>>>Sure, you can for example compute validity of the transform during
>the GIMPLE loop opts, annotate the loop meta-information with the
>desired transform and apply it (or not) later >>during RTL unrolling.
>
>Thanks !! Has RTL unrolling been already implemented?

Yes but not of non-innermost loops afaik.

Richard

>Richard.
>
>> Thanks & Regards
>> Ajit

Re: Register Pressure guided Unroll and Jam in GCC !!

2014-06-16 Thread Aaron Sawdey

On Mon, 2014-06-16 at 14:14 +, Ajit Kumar Agarwal wrote:
> Hello All:
>  
> I have worked on the Open64 compiler where the Register Pressure Guided 
> Unroll and Jam gave a good amount of performance improvement for the  C and 
> C++ Spec Benchmark and also Fortran benchmarks.
> 
> The Unroll and Jam increases the register pressure in the Unrolled Loop 
> leading to increase in the Spill and Fetch degrading the performance of the 
> Unrolled Loop. The Performance of Cache locality achieved through Unroll and 
> Jam is degraded with the presence of Spilling instruction due to increases in 
> register pressure Its better to do the decision  of Unrolled Factor of the 
> Loop based on the Performance model of the register pressure.
> 
> Most of the Loop Optimization Like Unroll and Jam is implemented in the High 
> Level IR. The register pressure based Unroll and Jam requires the calculation 
> of register pressure in the High Level IR  which will be similar to register 
> pressure we calculate on Register Allocation. This makes the implementation 
> complex.
> 
> To overcome this, the Open64 compiler does the decision of Unrolling to both 
> High Level IR and also at the Code Generation Level. Some of the decisions 
> way at the end of the Code Generation . The advantage of using this approach 
> like Open64 helps in using the register pressure information calculated by 
> the Register Allocator. This helps the implementation much simpler and less 
> complex.
> 
> Can we have this approach in GCC of the Decisions of Unroll and Jam in the 
> High Level IR  and also to defer some of the decision at the Code Generation 
> Level like Open64? 
> 
>  Please let me know what do you think.

I have been working on calculating something analogous to register
pressure using a count of the number of live SSA values during the
ipa-inline pass. I've been working on steering inlining (especially in
LTO) away from decisions that explode the register pressure downstream,
with a similar goal of avoiding situations that cause a lot of spill
code.

I have been working in a branch if you want to take a look:
gcc/branches/lto-pressure

Aaron

> 
> Thanks & Regards
> Ajit
> 

-- 
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain

Re: Register Pressure guided Unroll and Jam in GCC !!

2014-06-16 Thread Vladimir Makarov


On 2014-06-16, 10:14 AM, Ajit Kumar Agarwal wrote:

Hello All:

I have worked on the Open64 compiler where the Register Pressure Guided Unroll 
and Jam gave a good amount of performance improvement for the  C and C++ Spec 
Benchmark and also Fortran benchmarks.

The Unroll and Jam increases the register pressure in the Unrolled Loop leading 
to increase in the Spill and Fetch degrading the performance of the Unrolled 
Loop. The Performance of Cache locality achieved through Unroll and Jam is 
degraded with the presence of Spilling instruction due to increases in register 
pressure Its better to do the decision  of Unrolled Factor of the Loop based on 
the Performance model of the register pressure.

Most of the Loop Optimization Like Unroll and Jam is implemented in the High 
Level IR. The register pressure based Unroll and Jam requires the calculation 
of register pressure in the High Level IR  which will be similar to register 
pressure we calculate on Register Allocation. This makes the implementation 
complex.

To overcome this, the Open64 compiler does the decision of Unrolling to both 
High Level IR and also at the Code Generation Level. Some of the decisions way 
at the end of the Code Generation . The advantage of using this approach like 
Open64 helps in using the register pressure information calculated by the 
Register Allocator. This helps the implementation much simpler and less complex.

Can we have this approach in GCC of the Decisions of Unroll and Jam in the High 
Level IR  and also to defer some of the decision at the Code Generation Level 
like Open64?

  Please let me know what do you think.



Most loop optimizations are a good target for register pressure 
sensitive algorithms as loops are usually program hot spots and any 
pressure decrease there would be harmful as any RA can not undo such 
complex transformations.


So I guess your proposal could work.  Right now we have only 
pressure-sensitive modulo scheduling (SMS) and loop-invariant motion (as 
I remember switching from loop-invariant motion based on some very 
inaccurate register-pressure evaluation to one based on RA pressure 
evaluation gave a nice improvement about 1% for SPECFP2000 on some 
targets).

Re: Register Pressure guided Unroll and Jam in GCC !!

2014-06-16 Thread Vladimir Makarov


On 2014-06-16, 2:25 PM, Aaron Sawdey wrote:

On Mon, 2014-06-16 at 14:14 +, Ajit Kumar Agarwal wrote:

Hello All:

I have worked on the Open64 compiler where the Register Pressure Guided Unroll 
and Jam gave a good amount of performance improvement for the  C and C++ Spec 
Benchmark and also Fortran benchmarks.

The Unroll and Jam increases the register pressure in the Unrolled Loop leading 
to increase in the Spill and Fetch degrading the performance of the Unrolled 
Loop. The Performance of Cache locality achieved through Unroll and Jam is 
degraded with the presence of Spilling instruction due to increases in register 
pressure Its better to do the decision  of Unrolled Factor of the Loop based on 
the Performance model of the register pressure.

Most of the Loop Optimization Like Unroll and Jam is implemented in the High 
Level IR. The register pressure based Unroll and Jam requires the calculation 
of register pressure in the High Level IR  which will be similar to register 
pressure we calculate on Register Allocation. This makes the implementation 
complex.

To overcome this, the Open64 compiler does the decision of Unrolling to both 
High Level IR and also at the Code Generation Level. Some of the decisions way 
at the end of the Code Generation . The advantage of using this approach like 
Open64 helps in using the register pressure information calculated by the 
Register Allocator. This helps the implementation much simpler and less complex.

Can we have this approach in GCC of the Decisions of Unroll and Jam in the High 
Level IR  and also to defer some of the decision at the Code Generation Level 
like Open64?

  Please let me know what do you think.


I have been working on calculating something analogous to register
pressure using a count of the number of live SSA values during the
ipa-inline pass. I've been working on steering inlining (especially in
LTO) away from decisions that explode the register pressure downstream,
with a similar goal of avoiding situations that cause a lot of spill
code.

I have been working in a branch if you want to take a look:
gcc/branches/lto-pressure



Any pressure evaluation is a better than its absence.  But on this level 
it is hard to evaluate it accurately.


E.g. pressure in loop can be high for general regs, for fp regs or the 
both.  Using live SSA values is still very inaccurate to make a right 
decision for the transformations.

Re: Register Pressure guided Unroll and Jam in GCC !!

2014-06-16 Thread Aaron Sawdey

On Mon, 2014-06-16 at 14:42 -0400, Vladimir Makarov wrote:
> On 2014-06-16, 2:25 PM, Aaron Sawdey wrote:
> > On Mon, 2014-06-16 at 14:14 +, Ajit Kumar Agarwal wrote:
> >> Hello All:
> >>
> >> I have worked on the Open64 compiler where the Register Pressure Guided 
> >> Unroll and Jam gave a good amount of performance improvement for the  C 
> >> and C++ Spec Benchmark and also Fortran benchmarks.
> >>
> >> The Unroll and Jam increases the register pressure in the Unrolled Loop 
> >> leading to increase in the Spill and Fetch degrading the performance of 
> >> the Unrolled Loop. The Performance of Cache locality achieved through 
> >> Unroll and Jam is degraded with the presence of Spilling instruction due 
> >> to increases in register pressure Its better to do the decision  of 
> >> Unrolled Factor of the Loop based on the Performance model of the register 
> >> pressure.
> >>
> >> Most of the Loop Optimization Like Unroll and Jam is implemented in the 
> >> High Level IR. The register pressure based Unroll and Jam requires the 
> >> calculation of register pressure in the High Level IR  which will be 
> >> similar to register pressure we calculate on Register Allocation. This 
> >> makes the implementation complex.
> >>
> >> To overcome this, the Open64 compiler does the decision of Unrolling to 
> >> both High Level IR and also at the Code Generation Level. Some of the 
> >> decisions way at the end of the Code Generation . The advantage of using 
> >> this approach like Open64 helps in using the register pressure information 
> >> calculated by the Register Allocator. This helps the implementation much 
> >> simpler and less complex.
> >>
> >> Can we have this approach in GCC of the Decisions of Unroll and Jam in the 
> >> High Level IR  and also to defer some of the decision at the Code 
> >> Generation Level like Open64?
> >>
> >>   Please let me know what do you think.
> >
> > I have been working on calculating something analogous to register
> > pressure using a count of the number of live SSA values during the
> > ipa-inline pass. I've been working on steering inlining (especially in
> > LTO) away from decisions that explode the register pressure downstream,
> > with a similar goal of avoiding situations that cause a lot of spill
> > code.
> >
> > I have been working in a branch if you want to take a look:
> > gcc/branches/lto-pressure
> >
> 
> Any pressure evaluation is a better than its absence.  But on this level 
> it is hard to evaluate it accurately.
> 
> E.g. pressure in loop can be high for general regs, for fp regs or the 
> both.  Using live SSA values is still very inaccurate to make a right 
> decision for the transformations.
> 

Yes, the jump I have not made yet is to classify the pressure by what
register class it might end up in. The other big piece that's
potentially missing at that point is pressure caused by temps and by
scheduling. But I think you can still get order-of-magnitude type
estimates.

-- 
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain

[GSoC] Status - 20140616

2014-06-16 Thread Maxim Kuvyrkov

Hi Community,

We are 1 week away from midterm evaluations of students' work.  Mentors, please 
start looking closely into your student's progress and draft up evaluation 
notes.

Midterm evaluations are very important in GSoC.  Students who fail this 
evaluation are immediately kicked out of the program.  Students who pass -- get 
their midterm payment ($2250).

Both mentors and students will need to submit midterm evaluations between June 
23-27.  There is no excuse for not submitting your evaluations.  Please let me 
know if you have any problems submitting your evaluation in the period June 
23-27.

For evaluations, you might find this guide helpful: 
http://en.flossmanuals.net/GSoCMentoring/evaluations/ .

On another note, copyright assignments are now completed for 4 out of 5 
students.  I have pinged the last student to get his assignment in order.

--
Maxim Kuvyrkov
www.linaro.org

Re: Help understand the may_be_zero field in loop niter information

Re: [PATCH] tell gcc optimizer to never introduce new data races

Re: [PATCH] tell gcc optimizer to never introduce new data races

Re: [GSoC] decision tree first steps

Re: [PATCH] tell gcc optimizer to never introduce new data races

vector load Rematerialization!!

Register Pressure guided Unroll and Jam in GCC !!

Re: [GSoC] decision tree first steps

Re: Register Pressure guided Unroll and Jam in GCC !!

Re: [PATCH] tell gcc optimizer to never introduce new data races

RE: Register Pressure guided Unroll and Jam in GCC !!

RE: Register Pressure guided Unroll and Jam in GCC !!

Re: Register Pressure guided Unroll and Jam in GCC !!

Re: Register Pressure guided Unroll and Jam in GCC !!

Re: Register Pressure guided Unroll and Jam in GCC !!

Re: Register Pressure guided Unroll and Jam in GCC !!

[GSoC] Status - 20140616

17 matches

Site Navigation

Mail list logo

Footer information