from:"Kewen.Lin via Gcc"

Re: How to extend SLP to support this case

2020-03-10 Thread Kewen.Lin via Gcc

Hi Richi,

on 2020/3/10 下午7:12, Richard Biener wrote:
> On Tue, Mar 10, 2020 at 7:52 AM Kewen.Lin  wrote:
>>
>> Hi all,
>>
>> But how to teach it to be aware of this? Currently the processing starts
>> from bottom to up (from stores), can we do some analysis on the SLP
>> instance, detect some pattern and update the whole instance?
> 
> In theory yes (Tamar had something like that for AARCH64 complex
> rotations IIRC).  And yes, the issue boils down to how we handle
> SLP discovery.  I'd like to improve SLP discovery but it's on my list
> only after I managed to get rid of the non-SLP code paths.  I have
> played with some ideas (even produced hackish patches) to find
> "seeds" to form SLP groups from using multi-level hashing of stmts.
> My plan is to rewrite SLP discovery completely, starting from a
> SLP graph that 1:1 reflects the SSA use-def graph (no groups
> formed yet) and then form groups from seeds and insert
> "connection" nodes (merging two subgraphs with less lanes
> into one with more lanes, splitting lanes, permuting lanes, etc.).
> 

Nice!  Thanks for the information!  This improvement sounds big and
promising!  If we can discovery SLP opportunities origins from loads,
this case isn't a big deal then.

> Currently I'm working on doing exactly this but only for SLP loads
> (because that's where it's most difficult...).

This case looks can be an input for your work?  since the isomorphic
computation are easy to detect from loads.

>> A-way requires some additional vector permutations.  However, I thought
>> if the existing scheme can't get any SLP chances, it looks reasonable to
>> extend it to consider this A-way grouping.  Does it make sense?
>>
>> Another question is that even if we can go with A-way grouping, it can
>> only handle packing one byte (four iteration -> 4), what place is
>> reasonable to extend it to pack more?  How about scaning all leaf
>> nodes and consider/transform them together?  too hacky?
> 
> Well, not sure - in the end it will all be heuristics since otherwise
> the exploration space is too big.  But surely concentrating on
> load/store groups is good.
> 

Totally agreed.  This hacky idea orgined from the existing codes, if SLP
discovery improves, I think it's useless then.

> The SLP discovery code is already quite complicated btw., I'd
> hate to add "unstructured" hacks ontop of it right now without
> future design goals in mind.

OK.  Looking forward to its landing!

BR,
Kewen

Re: How to extend SLP to support this case

2020-03-10 Thread Kewen.Lin via Gcc

Hi Tamar,

on 2020/3/10 下午7:31, Tamar Christina wrote:
> 
>> -Original Message-
>> From: Gcc  On Behalf Of Richard Biener
>> Sent: Tuesday, March 10, 2020 11:12 AM
>> To: Kewen.Lin 
>> Cc: GCC Development ; Segher Boessenkool
>> 
>> Subject: Re: How to extend SLP to support this case
>>
>> On Tue, Mar 10, 2020 at 7:52 AM Kewen.Lin  wrote:
>>>
>>> Hi all,
>>>
>>> But how to teach it to be aware of this? Currently the processing
>>> starts from bottom to up (from stores), can we do some analysis on the
>>> SLP instance, detect some pattern and update the whole instance?
>>
>> In theory yes (Tamar had something like that for AARCH64 complex rotations
>> IIRC).  And yes, the issue boils down to how we handle SLP discovery.  I'd 
>> like
>> to improve SLP discovery but it's on my list only after I managed to get rid 
>> of
>> the non-SLP code paths.  I have played with some ideas (even produced
>> hackish patches) to find "seeds" to form SLP groups from using multi-level
>> hashing of stmts.
> 
> I still have this but missed the stage-1 deadline after doing the rewriting 
> to C++ 😊
> 
> We've also been looking at this and the approach I'm investigating now is 
> trying to get
> the SLP codepath to handle this after it's been fully unrolled. I'm looking 
> into whether
> the build-slp can be improved to work for the group size == 16 case that it 
> tries but fails
> on. 
> 

Thanks!  Glad to know you have been working this!

Yes, I saw the standalone SLP pass split the group (16 store stmts) finally.

> My intention is to see if doing so would make it simpler to recognize this as 
> just 4 linear
> loads and two permutes. I think the loop aware SLP will have a much harder 
> time with this
> seeing the load permutations it thinks it needs because of the permutes 
> caused by the +/-
> pattern.

I may miss something, just to double confirm, do you mean for either of p1/p2 
make it 
4 linear loads?  Since as the optimal vectorized version, p1 and p2 have 4 
separate
loads and construction then further permutations.

> 
> One Idea I had before was from your comment on the complex number patch, 
> which is to try
> and move up TWO_OPERATORS and undo the permute always when doing +/-. This 
> would simplify
> the load permute handling and if a target doesn't have an instruction to 
> support this it would just
> fall back to doing an explicit permute after the loads.  But I wasn't sure 
> this approach would get me the
> results I wanted.> 

IIUC, we have to seek for either  or  ...,
since either can leverage the isomorphic byte loads, subtraction, shift and 
addition.

I was thinking that SLP pattern matcher can detect the pattern with two levels 
of TWO_OPERATORS,
one level is with t/0,1,2,3,/, the other is with a/0,1,2,3/, as well as the 
dependent isomorphic
computations for a/0,1,2,3/, transform it into isomorphic subtraction, int 
promotion shift and addition.

> In the end you don't want a loop here at all. And in order to do the above 
> with TWO_OPERATORS I would
> have to let the SLP pattern matcher be able to reduce the group size and 
> increase the no# iterations during
> the matching otherwise the matching itself becomes quite difficult in certain 
> cases..
> 

OK, it sounds unable to get the optimal one which requires all 16 bytes (0-3 or 
4-7 x 4 iterations).

BR,
Kewen

Re: How to extend SLP to support this case

2020-03-10 Thread Kewen.Lin via Gcc

Hi Richi,

on 2020/3/10 下午7:14, Richard Biener wrote:
> On Tue, Mar 10, 2020 at 12:12 PM Richard Biener
>  wrote:
>>
>> On Tue, Mar 10, 2020 at 7:52 AM Kewen.Lin  wrote:
>>>
>>> Hi all,
>>>
>>> I'm investigating whether GCC can vectorize the below case on ppc64le.
>>>
>>>   extern void test(unsigned int t[4][4]);
>>>
>>>   void foo(unsigned char *p1, int i1, unsigned char *p2, int i2)
>>>   {
>>> unsigned int tmp[4][4];
>>> unsigned int a0, a1, a2, a3;
>>>
>>> for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) {
>>>   a0 = (p1[0] - p2[0]) + ((p1[4] - p2[4]) << 16);
>>>   a1 = (p1[1] - p2[1]) + ((p1[5] - p2[5]) << 16);
>>>   a2 = (p1[2] - p2[2]) + ((p1[6] - p2[6]) << 16);
>>>   a3 = (p1[3] - p2[3]) + ((p1[7] - p2[7]) << 16);
>>>
>>>   int t0 = a0 + a1;
>>>   int t1 = a0 - a1;
>>>   int t2 = a2 + a3;
>>>   int t3 = a2 - a3;
>>>
>>>   tmp[i][0] = t0 + t2;
>>>   tmp[i][2] = t0 - t2;
>>>   tmp[i][1] = t1 + t3;
>>>   tmp[i][3] = t1 - t3;
>>> }
>>> test(tmp);
>>>   }
>>>
...
>>> From the above, the key thing is to group tmp[i][j] i=/0,1,2,3/ together, 
>>> eg:
>>>   tmp[i][0] i=/0,1,2,3/ (one group)
>>>   tmp[i][1] i=/0,1,2,3/ (one group)
>>>   tmp[i][2] i=/0,1,2,3/ (one group)
>>>   tmp[i][3] i=/0,1,2,3/ (one group)
>>>
>>> which tmp[i][j] group have the same isomorphic computations.  But currently
>>> SLP is unable to divide group like this way. (call it as A-way for now)
>>>
>>> It's understandable since it has better adjacent store groups like,
>>>   tmp[0][i] i=/0,1,2,3/ (one group)
>>>   tmp[1][i] i=/0,1,2,3/ (one group)
>>>   tmp[2][i] i=/0,1,2,3/ (one group)
>>>   tmp[3][i] i=/0,1,2,3/ (one group)
> 
> Note this is how the non-SLP path will (try to) vectorize the loop.
> 

Oops, sorry for the confusion with poor writing, it's intended to show
how the current SLP group those 16 stores tmp[i][j] i,j=/0,1,2,3/
with completely unrolled.  I saw it split 16 stmts into 4 groups like
this way finally.

BR,
Kewen

[RFC, doloop] How to deal with invariants introduced by doloop

2020-04-17 Thread Kewen.Lin via Gcc

Hi all,

This is one question origining from PR61837, which shows that doloop
pass could introduce some loop invariants against its outer loop when
doloop performs and prepares the iteration count for hardware
count register assignment. 

Even with Xionghu's simplification patch
https://gcc.gnu.org/pipermail/gcc-patches/2020-April/543887.html,
we still have one zero extension un-hoisted, and possibly much
more if we don't have any chances to simplify it.

The current related pass pipeline is:

NEXT_PASS (pass_rtl_loop_init);
NEXT_PASS (pass_rtl_move_loop_invariants);
NEXT_PASS (pass_rtl_unroll_loops);
NEXT_PASS (pass_rtl_doloop);
NEXT_PASS (pass_rtl_loop_done);

* How about tweaking pass order?

  1. move pass_rtl_move_loop_invariants after pass_rtl_doloop?

I don't know the historical reason on this order, but one no-go
reason may be that it can impact following unroll, which is
sensitive to #insn.  Hoisting can probably reduce loop body #insn,
easy to break existing things without it there?

  2. move pass_rtl_doloop before pass_rtl_move_loop_invariants?

It looks impractical, pass_rtl_unroll_loops can update the loop
niter which is required by doloop preparation, predicting the value
correctly sounds impossible.

* rerun hoisting pass?

  3. run pass_rtl_move_loop_invariants again after pass_rtl_doloop?

It looks practical? But currently on Power it can't hoist all the
invariants, it's caused that the pseudo used for count register which
isn't taken as "invariant" by pass_rtl_move_loop_invariants, since
we still leave the doloop end pattern inside the loop, for example:
  
  67: r143:SI=r119:DI#0-0x1   
  68: r142:DI=zero_extend(r143:SI)
  69: r141:DI=r142:DI+0x1// count
  ...
  66: {pc={(r141:DI!=0x1)?L64:pc};r141:DI=r141:DI-0x1;
   clobber scratch;clobber scratch;}   // doloop_end

though the doloop_end will be eliminated on Power eventually,

I noticed that there is one hook doloop_begin, which looks to help
ports to introduce loop count pseudo.  By supporting this for rs6000, 
I can obtain the insns what I expected on Power.

Original:
   12: NOTE_INSN_BASIC_BLOCK 4
   67: %9:SI=%5:SI-0x1   // invariant
   14: %8:DI=%4:DI-0x1
   68: %9:DI=zero_extend(%9:SI)  // invariant
   15: %10:DI=%3:DI
   69: %9:DI=%9:DI+0x1   // invariant
   82: ctr:DI=%9:DI
  REG_DEAD %9:DI

With rerun hoisting:

   12: NOTE_INSN_BASIC_BLOCK 4
   69: %8:DI=%5:DI+0x1  // invariant
   14: %10:DI=%4:DI-0x1
   84: ctr:DI=%8:DI
  REG_DEAD %8:DI
   15: %9:DI=%3:DI

With rerun hoisting + doloop_begin:

   12: NOTE_INSN_BASIC_BLOCK 4
   70: ctr:DI=%5:DI
   14: %10:DI=%4:DI-0x1
   15: %9:DI=%3:DI

But I still had the concern that I'm not sure the current hoisting pass
will consider register pressure, if the niter related invariants
have some derived uses inside the loop, probably creating more 
register pressure here.

Attached is the initial patch for this naive proposal, I'd like to post it
first to see whether it's reasonable to proceed.  If it's reasonable, one
thing to be improved can be to guard it for only the related outer loops
of which doloop perform succesfully on to save compilation time.

Welcome any comments!

BR,
Kewen
diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index 11ab74589b4..07340455348 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -12658,6 +12658,20 @@
   return "ori %0,%0,0";
 })
 
+
+;; Define doloop_begin to introduce one pseudo for count register.
+(define_expand "doloop_begin"
+  [(use (match_operand 0)) ; loop pseudo
+   (use (match_operand 1))]; doloop_end seq
+  ""
+{
+  rtx ctr_pseudo = gen_reg_rtx (Pmode);
+  rtx_insn* end_seq = (rtx_insn*) operands[1];
+  validate_replace_rtx (operands[0], ctr_pseudo, end_seq);
+  emit_move_insn (ctr_pseudo, operands[0]);
+  DONE;
+})
+
 ;; Define the subtract-one-and-jump insns, starting with the template
 ;; so loop.c knows what to generate.
 
diff --git a/gcc/loop-init.c b/gcc/loop-init.c
index 401e5282907..0cbb8e0bd32 100644
--- a/gcc/loop-init.c
+++ b/gcc/loop-init.c
@@ -516,6 +516,7 @@ public:
   {}
 
   /* opt_pass methods: */
+  opt_pass * clone () { return new pass_rtl_move_loop_invariants (m_ctxt); }
   virtual bool gate (function *) { return flag_move_loop_invariants; }
   virtual unsigned int execute (function *fun)
 {
diff --git a/gcc/passes.def b/gcc/passes.def
index 2bf2cb78fc5..4d8a63f2ae3 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -429,6 +429,7 @@ along with GCC; see the file COPYING3.  If not see
  NEXT_PASS (pass_rtl_move_loop_invariants);
  NEXT_PASS (pass_rtl_unroll_loops);
  NEXT_PASS (pass_rtl_

Re: [RFC, doloop] How to deal with invariants introduced by doloop

2020-04-21 Thread Kewen.Lin via Gcc

on 2020/4/17 下午7:32, Richard Biener wrote:
> On Fri, Apr 17, 2020 at 1:10 PM Kewen.Lin via Gcc  wrote:
>>
>> Hi all,
>>
>> This is one question origining from PR61837, which shows that doloop
>> pass could introduce some loop invariants against its outer loop when
>> doloop performs and prepares the iteration count for hardware
>> count register assignment.
>>
> I suggest to try both approaches and count the number of transforms
> done in each instance.
> 
> Richard.
> 

Hi Richi,

Thanks for the suggestion, some statistics were collected as below.

A: default
B: move pass_rtl_move_loop_invariants after pass_rtl_doloop
C: rerun pass_rtl_move_loop_invariants after pass_rtl_doloop
D: C + doloop_begin

Ran by bootstrapping and regression testing on ppc64le Power8 configured
with languages c,c++,fortran,objc,obj-c++,go.

Counting move #transformations in function move_invariant_reg (before
label fail, probably better with inv == repr to filter out those 
replacements with rep, but I think the trend is similar?).

A: 802650
B: 841476  
C: 803485 (C1) + 827883 (C2)
D: 802622 (D1) + 841476 (D2)

Let's call pass_rtl_move_loop_invariants as hoisting.
PS: C1/D1 for 1st time hoisting while C2/D2 for 2nd time hoisting.
The small differences (~0.1%) among A/C1/D1 should be caused by noise.

The numbers with twice runs (C/D) are almost two times of one time run,
which surprised me.  By further investigation, it looks the current
pass_rtl_move_loop_invariants has something to be improved if we want
to rerun it.  Taking gcc/testsuite/gfortran.dg/inline_matmul_16.f90 at 
-O1 as example.  C1 does 178 transforms and C2 does 165, it's unrelated
to unroll/doloop passes, this result isn't changed by disabling them
explicitly.

Currently, without flag_ira_loop_pressure, the regs_used estimation
isn't good, I'd expect that invs which are hoisted first time from 
the loop should be counted as regs_used next time at regs_used
analysis.  By checking the regs_used, it's set as 2 for all loops of
case inline_matmul_16, either C1 or C2.  I think it leads the 2nd
hoisting optimistically estimate register pressure and hoist more.
By simple hacking by considering 1st hoisting new_reg, I can see the
2nd hoisting has fewer moves (57).  It means the above statistics
comparison is unfair and unreliable.

With flag_ira_loop_pressure, the #transforms become to 255 (1st) and 
68 (2nd), it looks better but might also need more enhancements?

Since rs6000 sets flag_ira_loop_pressure at O3, I did SPEC2017
performance evaluation on Power8 (against baseline A) with option
-Ofast -funroll-loops:
 * B showed 525.x264_r +1.43%, 538.imagick_r +1.23% speedup 
   but 503.bwaves_r -2.74% degradation.
 * C showed 500.perlbench_r -1.31%, 520.omnetpp_r -2.20% degradation.

The evaluation shows running hoisting after doloop can give us some
benefits, but to rerun it twice isn't able to give us the similar
gains.  It looks regardless of flag_ira_loop_pressure, to rerun
the pass requires more tweaks, probably considering those related
parameters.  If go with B, we need to figure out what we miss forbwaves_r.
BR,
Kewen

Re: [RFC, doloop] How to deal with invariants introduced by doloop

2020-04-21 Thread Kewen.Lin via Gcc

on 2020/4/21 下午5:43, Richard Biener wrote:
> On Tue, Apr 21, 2020 at 10:42 AM Kewen.Lin  wrote:
>>
>> on 2020/4/17 下午7:32, Richard Biener wrote:
>>> On Fri, Apr 17, 2020 at 1:10 PM Kewen.Lin via Gcc  wrote:
>>>>
>>>> Hi all,
>>>>
>>>> This is one question origining from PR61837, which shows that doloop
>>>> pass could introduce some loop invariants against its outer loop when
>>>> doloop performs and prepares the iteration count for hardware
>>>> count register assignment.
>>>>
>>> I suggest to try both approaches and count the number of transforms
>>> done in each instance.
>>>
>>
>> Currently, without flag_ira_loop_pressure, the regs_used estimation
>> isn't good, I'd expect that invs which are hoisted first time from
>> the loop should be counted as regs_used next time at regs_used
>> analysis.  By checking the regs_used, it's set as 2 for all loops of
>> case inline_matmul_16, either C1 or C2.  I think it leads the 2nd
>> hoisting optimistically estimate register pressure and hoist more.
>> By simple hacking by considering 1st hoisting new_reg, I can see the
>> 2nd hoisting has fewer moves (57).  It means the above statistics
>> comparison is unfair and unreliable.
> 
> OK, so that alone argues against doing C or D without better understanding
> and fixing this.  That is, when you invoke invariant motion twice at its
> current place the second invocation shouldn't really do any more hoisting,
> definitely not a significant amount.
> 

Exactly, I was expecting this before doing the data collection, the result
surprised me.  It looks more feasible to go with B from this perspective.

>> With flag_ira_loop_pressure, the #transforms become to 255 (1st) and
>> 68 (2nd), it looks better but might also need more enhancements?
>>
>> Since rs6000 sets flag_ira_loop_pressure at O3, I did SPEC2017
>> performance evaluation on Power8 (against baseline A) with option
>> -Ofast -funroll-loops:
>>  * B showed 525.x264_r +1.43%, 538.imagick_r +1.23% speedup
>>but 503.bwaves_r -2.74% degradation.
>>  * C showed 500.perlbench_r -1.31%, 520.omnetpp_r -2.20% degradation.
>>
>> The evaluation shows running hoisting after doloop can give us some
>> benefits, but to rerun it twice isn't able to give us the similar
>> gains.  It looks regardless of flag_ira_loop_pressure, to rerun
>> the pass requires more tweaks, probably considering those related
>> parameters.  If go with B, we need to figure out what we miss forbwaves_r.
> 
> Of course it also requires benchmarking on other archs.
> 

Yes, I'll check with -O2 which doesn't have flag_ira_loop_pressure,
w/ and w/o unrolling additionally.


BR,
Kewen

Question on tree LIM

2021-07-01 Thread Kewen.Lin via Gcc

Hi,

I am investigating one degradation related to SPEC2017 exchange2_r,
with loop vectorization on at -O2, it degraded by 6%.  By some
isolation, I found it isn't directly caused by vectorization itself,
but exposed by vectorization, some stuffs for vectorization
condition checks are hoisted out and they increase the register
pressure, finally results in more spillings than before.  If I simply
disable tree lim4, I can see the gap becomes smaller (just 40%+ of
the original), if further disable rtl lim, it just becomes to 30% of
the original.  It seems to indicate there is some room to improve in
both LIMs.

By quick scanning in tree LIM, I noticed that there seems no any
considerations on register pressure, it looked intentional? I am
wondering what's the design philosophy behind it?  Is it because that
it's hard to model register pressure well here?  If so, it seems to
put the burden onto late RA, which needs to have a good
rematerialization support.

btw, the example loop is at line 1150 from src exchange2.fppized.f90

   1150 block(rnext:9, 7, i7) = block(rnext:9, 7, i7) + 10

The extra hoisted statements after the vectorization on this loop
(cheap cost model btw) are:

_686 = (integer(kind=8)) rnext_679;
_ = (sizetype) _19;
_1112 = _ * 12;
_1927 = _1112 + 12;
  * _1895 = _1927 - _2650;
_1113 = (unsigned long) rnext_679;
  * niters.6220_1128 = 10 - _1113;
  * _1021 = 9 - _1113;
  * bnd.6221_940 = niters.6220_1128 >> 2;
  * niters_vector_mult_vf.6222_939 = niters.6220_1128 & 18446744073709551612;
_144 = niters_vector_mult_vf.6222_939 + _1113;
tmp.6223_934 = (integer(kind=8)) _144;
S.823_1004 = _1021 <= 2 ? _686 : tmp.6223_934;
  * ivtmp.6410_289 = (unsigned long) S.823_1004;

PS: * indicates the one has a long live interval.

BR,
Kewen

Re: Question on tree LIM

2021-07-02 Thread Kewen.Lin via Gcc

Hi Richard,

on 2021/7/2 下午4:07, Richard Biener wrote:
> On Fri, Jul 2, 2021 at 5:34 AM Kewen.Lin via Gcc  wrote:
>>
>> Hi,
>>
>> I am investigating one degradation related to SPEC2017 exchange2_r,
>> with loop vectorization on at -O2, it degraded by 6%.  By some
>> isolation, I found it isn't directly caused by vectorization itself,
>> but exposed by vectorization, some stuffs for vectorization
>> condition checks are hoisted out and they increase the register
>> pressure, finally results in more spillings than before.  If I simply
>> disable tree lim4, I can see the gap becomes smaller (just 40%+ of
>> the original), if further disable rtl lim, it just becomes to 30% of
>> the original.  It seems to indicate there is some room to improve in
>> both LIMs.
>>
>> By quick scanning in tree LIM, I noticed that there seems no any
>> considerations on register pressure, it looked intentional? I am
>> wondering what's the design philosophy behind it?  Is it because that
>> it's hard to model register pressure well here?  If so, it seems to
>> put the burden onto late RA, which needs to have a good
>> rematerialization support.
> 
> Yes, it is "intentional" in that doing any kind of prioritization based
> on register pressure is hard on the GIMPLE level since most
> high-level transforms try to expose followup transforms which you'd
> somehow have to anticipate.  Note that LIMs "cost model" (if you can
> call it such...) is too simplistic to be a good base to decide which
> 10 of the 20 candidates you want to move (and I've repeatedly pondered
> to remove it completely).
> 

Thanks for the explanation!  Do you really want to remove it completely
rather than just improve it with a better one?  :-\

Here there are some PRs (PR96825, PR98782) related to exchange2_r which
seems to suffer from high register pressure and bad spillings.  Not sure
whether they are also somehow related to the pressure given from LIM, but
the trigger is commit
1118a3ff9d3ad6a64bba25dc01e7703325e23d92 which adjusts prediction
frequency, maybe it's worth to re-visiting this idea about considering
BB frequency in LIM cost model:
https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html

> As to putting the burden on RA - yes, that's one possibility.  The other
> possibility is to use the register-pressure aware scheduler, though not
> sure if that will ever move things into loop bodies.
> 

Brandly new idea!  IIUC it requires a global scheduler, not sure how well
GCC global scheduler performs, generally speaking the register-pressure
aware scheduler will prefer the insn which has more deads (for that
intensive regclass), for this problem the modeling seems a bit different,
it has to care about total interference numbers between two "equivalent"
blocks (src/dest), not sure if it's easier to do than rematerialization.

>> btw, the example loop is at line 1150 from src exchange2.fppized.f90
>>
>>1150 block(rnext:9, 7, i7) = block(rnext:9, 7, i7) + 10
>>
>> The extra hoisted statements after the vectorization on this loop
>> (cheap cost model btw) are:
>>
>> _686 = (integer(kind=8)) rnext_679;
>> _ = (sizetype) _19;
>> _1112 = _ * 12;
>> _1927 = _1112 + 12;
>>   * _1895 = _1927 - _2650;
>> _1113 = (unsigned long) rnext_679;
>>   * niters.6220_1128 = 10 - _1113;
>>   * _1021 = 9 - _1113;
>>   * bnd.6221_940 = niters.6220_1128 >> 2;
>>   * niters_vector_mult_vf.6222_939 = niters.6220_1128 & 18446744073709551612;
>> _144 = niters_vector_mult_vf.6222_939 + _1113;
>> tmp.6223_934 = (integer(kind=8)) _144;
>> S.823_1004 = _1021 <= 2 ? _686 : tmp.6223_934;
>>   * ivtmp.6410_289 = (unsigned long) S.823_1004;
>>
>> PS: * indicates the one has a long live interval.
> 
> Note for the vectorizer generated conditions there's quite some room for
> improvements to reduce the amount of semi-redundant computations.  I've
> pointed out some to Andre, in particular suggesting to maintain a single
> "remaining scalar iterations" IV across all the checks to avoid keeping
> 'niters' live and doing all the above masking & shifting repeatedly before
> the prologue/main/vectorized epilogue/epilogue loops.  Not sure how far
> he got with that idea.
> 

Great, it definitely helps to mitigate this problem.  Thanks for the 
information.


BR,
Kewen

Re: Question on tree LIM

2021-07-04 Thread Kewen.Lin via Gcc

on 2021/7/2 下午7:28, Richard Biener wrote:
> On Fri, Jul 2, 2021 at 11:05 AM Kewen.Lin  wrote:
>>
>> Hi Richard,
>>
>> on 2021/7/2 下午4:07, Richard Biener wrote:
>>> On Fri, Jul 2, 2021 at 5:34 AM Kewen.Lin via Gcc  wrote:
>>>>
>>>> Hi,
>>>>
>>>> I am investigating one degradation related to SPEC2017 exchange2_r,
>>>> with loop vectorization on at -O2, it degraded by 6%.  By some
>>>> isolation, I found it isn't directly caused by vectorization itself,
>>>> but exposed by vectorization, some stuffs for vectorization
>>>> condition checks are hoisted out and they increase the register
>>>> pressure, finally results in more spillings than before.  If I simply
>>>> disable tree lim4, I can see the gap becomes smaller (just 40%+ of
>>>> the original), if further disable rtl lim, it just becomes to 30% of
>>>> the original.  It seems to indicate there is some room to improve in
>>>> both LIMs.
>>>>
>>>> By quick scanning in tree LIM, I noticed that there seems no any
>>>> considerations on register pressure, it looked intentional? I am
>>>> wondering what's the design philosophy behind it?  Is it because that
>>>> it's hard to model register pressure well here?  If so, it seems to
>>>> put the burden onto late RA, which needs to have a good
>>>> rematerialization support.
>>>
>>> Yes, it is "intentional" in that doing any kind of prioritization based
>>> on register pressure is hard on the GIMPLE level since most
>>> high-level transforms try to expose followup transforms which you'd
>>> somehow have to anticipate.  Note that LIMs "cost model" (if you can
>>> call it such...) is too simplistic to be a good base to decide which
>>> 10 of the 20 candidates you want to move (and I've repeatedly pondered
>>> to remove it completely).
>>>
>>
>> Thanks for the explanation!  Do you really want to remove it completely
>> rather than just improve it with a better one?  :-\
> 
> ;)  For example the LIM cost model makes it not hoist an invariant (int)x
> but then PRE which detects invariant motion opportunities as partial
> redundances happily does (because PRE has no cost model at all - heh).
> 

Got it, thanks for further clarification. :)

>> Here there are some PRs (PR96825, PR98782) related to exchange2_r which
>> seems to suffer from high register pressure and bad spillings.  Not sure
>> whether they are also somehow related to the pressure given from LIM, but
>> the trigger is commit
>> 1118a3ff9d3ad6a64bba25dc01e7703325e23d92 which adjusts prediction
>> frequency, maybe it's worth to re-visiting this idea about considering
>> BB frequency in LIM cost model:
>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
> 
> Note most "problems", and those which are harder to undo, stem from
> LIMs store-motion which increases register pressure inside loops by
> adding loop-carried dependences.  The BB frequency might be a way
> to order candidates when we have a way to set a better cap on the
> number of refs to move.  Note the current "cost" model is rather a
> benefit model and causes us to not move cheap things (like the above
> conversion) because it seems not worth the trouble.
> 

Yeah, I noticed it at least excludes "cheap" ones.

> Note a very simple way would be to have a --param specifying a
> maximum number of refs to move (but note there are several
> LIM/store-motion passes so any such static limit would have
> surprising effects).  For store-motion I considered a hard limit on
> the number of loop carried dependences (PHIs) and counting both
> existing and added ones (to avoid the surprise).
> 
> Note how such limits or other cost models should consider inner and
> outer loop behavior remains to be determined - at least LIM works
> at the level of whole loop nests and there's a rough idea of dependent
> transforms but simply gathering candidates and stripping some isn't
> going to work without major surgery in that area I think.
> 

Thanks for all the notes and thoughts, I might had better to visit RA remat
first, Xionghu had some interests to investigate how to consider BB freq in
LIMs, I will check its effect and further check these ideas if need then.

BR,
Kewen

>>> As to putting the burden on RA - yes, that's one possibility.  The other
>>> possibility is to use the register-pressure aware scheduler, though not
>>> sure if that will ever move things into loop bodies.
>>&

Re: Enable the vectorizer at -O2 for GCC 12

2021-08-30 Thread Kewen.Lin via Gcc

on 2021/8/30 下午10:11, Bill Schmidt wrote:
> On 8/30/21 8:04 AM, Florian Weimer wrote:
>> There has been a discussion, both off-list and on the gcc-help mailing
>> list (“Why vectorization didn't turn on by -O2”, spread across several
>> months), about enabling the auto-vectorizer at -O2, similar to what
>> Clang does.
>>
>> I think the review concluded that the very cheap cost model should be
>> used for that.
>>
>> Are there any remaining blockers?
> 
> Hi Florian,
> 
> I don't think I'd characterize it as having blockers, but we are continuing 
> to investigate small performance issues that arise with very-cheap, including 
> some things that regressed in GCC 12.  Kewen Lin is leading that effort.  
> Kewen, do you feel we have any major remaining concerns with this plan?
> 

Hi Florian & Bill,

There are some small performance issues like PR101944 and PR102054, and
still two degraded bmks (P9 520.omnetpp_r -2.41% and P8 526.blender_r
-1.31%) to be investigated/clarified, but since their performance numbers
with separated loop and slp vectorization options look neutral, they are
very likely noises.  IMHO I don't think they are/will be blockers.  

So I think it's good to turn this on by default for Power.

BR,
Kewen

Re: Kewen Lin as PowerPC port co-maintainer

2022-05-04 Thread Kewen.Lin via Gcc

on 2022/5/5 03:20, David Edelsohn wrote:
> I am pleased to announce that the GCC Steering Committee has
> appointed Kewen Lin as GCC PowerPC port Co-Maintainer.
> 

Thanks a lot! It's a great honor for me!

> Please join me in congratulating Kewen on his new role.
> Kewen, please update your listing in the MAINTAINERS file.
> 

Updated MAINTAINERS file in r13-127.

BR,
Kewen

> Happy hacking!
> David

RFC: Make builtin types only valid for some target features

2022-12-02 Thread Kewen.Lin via Gcc

Hi,

I'm working to find one solution for PR106736, which requires us to
make some built-in types only valid for some target features, and
emit error messages for the types when the condition isn't satisfied.  
A straightforward idea is to guard the registry of built-in type under
the corresponding target feature.  But as Peter pointed out in the
PR, it doesn't work, as these built-in types are used by some built-in
functions which are required to be initialized always regardless of
target features, in order to support target pragma/attribute.  For
the validity checking on the built-in functions, it happens during
expanding the built-in functions calls, since till then we already
know the exact information on specific target feature.

One idea is to support built-in type checking in a similar way, to
check if all used type_decl (built-in type) are valid or not somewhere.
I hacked to simply check currently expanding gimple stmt is gassign
and further check the types of its operands, it did work but checking
gassign isn't enough.  Maybe I missed something, there seems not an
efficient way for a full check IMHO.

So I tried another direction, which was inspired by the existing
attribute altivec, to introduce an artificial type attribute and the
corresponding macro definition, during attribute handling it can check
target option node about target feature for validity.  The advantage
is that the checking happens in FE, so it reports error early, and it
doesn't need a later full checking on types.  But with some prototyping
work, I found one issue that it doesn't support param decl well, since
the handling on attributes of function decl happens after that on
attributes of param decl, so we aren't able to get exact target feature
information when handling the attributes on param decl.  It requires
front-end needs to change the parsing order, I guess it's not acceptable?
So I planed to give up and return to the previous direction.

Does the former idea sound good?  Any comments/suggestions, and other
ideas?

Thanks a lot in advance!

BR,
Kewen

Re: RFC: Make builtin types only valid for some target features

2022-12-05 Thread Kewen.Lin via Gcc

Hi Richard,

on 2022/12/5 15:31, Richard Sandiford wrote:
> "Kewen.Lin"  writes:
>> Hi,
>>
>> I'm working to find one solution for PR106736, which requires us to
>> make some built-in types only valid for some target features, and
>> emit error messages for the types when the condition isn't satisfied.  
>> A straightforward idea is to guard the registry of built-in type under
>> the corresponding target feature.  But as Peter pointed out in the
>> PR, it doesn't work, as these built-in types are used by some built-in
>> functions which are required to be initialized always regardless of
>> target features, in order to support target pragma/attribute.  For
>> the validity checking on the built-in functions, it happens during
>> expanding the built-in functions calls, since till then we already
>> know the exact information on specific target feature.
>>
>> One idea is to support built-in type checking in a similar way, to
>> check if all used type_decl (built-in type) are valid or not somewhere.
>> I hacked to simply check currently expanding gimple stmt is gassign
>> and further check the types of its operands, it did work but checking
>> gassign isn't enough.  Maybe I missed something, there seems not an
>> efficient way for a full check IMHO.
>>
>> So I tried another direction, which was inspired by the existing
>> attribute altivec, to introduce an artificial type attribute and the
>> corresponding macro definition, during attribute handling it can check
>> target option node about target feature for validity.  The advantage
>> is that the checking happens in FE, so it reports error early, and it
>> doesn't need a later full checking on types.  But with some prototyping
>> work, I found one issue that it doesn't support param decl well, since
>> the handling on attributes of function decl happens after that on
>> attributes of param decl, so we aren't able to get exact target feature
>> information when handling the attributes on param decl.  It requires
>> front-end needs to change the parsing order, I guess it's not acceptable?
>> So I planed to give up and return to the previous direction.
>>
>> Does the former idea sound good?  Any comments/suggestions, and other
>> ideas?
>>
>> Thanks a lot in advance!
> 
> FWIW, the aarch64 fp move patterns emit the error directly.  They then
> expand an integer-mode move, to provide some error recovery.  (The
> alternative would be to make the error fatal.)
> 
> (define_expand "mov"
>   [(set (match_operand:GPF_TF_F16_MOV 0 "nonimmediate_operand")
>   (match_operand:GPF_TF_F16_MOV 1 "general_operand"))]
>   ""
>   {
> if (!TARGET_FLOAT)
>   {
>   aarch64_err_no_fpadvsimd (mode);
>   machine_mode intmode
> = int_mode_for_size (GET_MODE_BITSIZE (mode), 0).require ();
>   emit_move_insn (gen_lowpart (intmode, operands[0]),
>   gen_lowpart (intmode, operands[1]));
>   DONE;
>   }
> 
> This isn't as user-friendly as catching the error directly in the FE,
> but I think in practice it's going to be very hard to trap all invalid
> uses of a type there.  Also, the user error in these situations is likely
> to be forgetting to enable the right architecture feature, rather than
> accidentally using the wrong type.  So an error about missing architecture
> features is probably good enough in most cases.
> 

Thanks a lot for your comments!  Yes, it's a question if it's worth to
spending non-trivial efforts to check and report all invalid uses.  For
the case in PR106736, it's enough to check in mov[ox]o define_expand
like the example you provided above, one trial patch was posted previously
at[1].  I think one difference is that we want to further check the actual
type instead of the mode, as Peter said "'types' are limited to MMA, but
the opaque modes are not limited to MMA.".

[1] https://gcc.gnu.org/bugzilla/attachment.cgi?id=53522&action=diff

BR,
Kewen

Re: RFC: Make builtin types only valid for some target features

2022-12-05 Thread Kewen.Lin via Gcc

Hi Andrew,

on 2022/12/5 18:10, Andrew Pinski wrote:
> On Sun, Dec 4, 2022 at 11:33 PM Richard Sandiford via Gcc
>  wrote:
>>
>> "Kewen.Lin"  writes:
>>> Hi,
>>>
>>> I'm working to find one solution for PR106736, which requires us to
>>> make some built-in types only valid for some target features, and
>>> emit error messages for the types when the condition isn't satisfied.
>>> A straightforward idea is to guard the registry of built-in type under
>>> the corresponding target feature.  But as Peter pointed out in the
>>> PR, it doesn't work, as these built-in types are used by some built-in
>>> functions which are required to be initialized always regardless of
>>> target features, in order to support target pragma/attribute.  For
>>> the validity checking on the built-in functions, it happens during
>>> expanding the built-in functions calls, since till then we already
>>> know the exact information on specific target feature.
>>>
>>> One idea is to support built-in type checking in a similar way, to
>>> check if all used type_decl (built-in type) are valid or not somewhere.
>>> I hacked to simply check currently expanding gimple stmt is gassign
>>> and further check the types of its operands, it did work but checking
>>> gassign isn't enough.  Maybe I missed something, there seems not an
>>> efficient way for a full check IMHO.
>>>
>>> So I tried another direction, which was inspired by the existing
>>> attribute altivec, to introduce an artificial type attribute and the
>>> corresponding macro definition, during attribute handling it can check
>>> target option node about target feature for validity.  The advantage
>>> is that the checking happens in FE, so it reports error early, and it
>>> doesn't need a later full checking on types.  But with some prototyping
>>> work, I found one issue that it doesn't support param decl well, since
>>> the handling on attributes of function decl happens after that on
>>> attributes of param decl, so we aren't able to get exact target feature
>>> information when handling the attributes on param decl.  It requires
>>> front-end needs to change the parsing order, I guess it's not acceptable?
>>> So I planed to give up and return to the previous direction.
>>>
>>> Does the former idea sound good?  Any comments/suggestions, and other
>>> ideas?
>>>
>>> Thanks a lot in advance!
>>
>> FWIW, the aarch64 fp move patterns emit the error directly.  They then
>> expand an integer-mode move, to provide some error recovery.  (The
>> alternative would be to make the error fatal.)
>>
>> (define_expand "mov"
>>   [(set (match_operand:GPF_TF_F16_MOV 0 "nonimmediate_operand")
>> (match_operand:GPF_TF_F16_MOV 1 "general_operand"))]
>>   ""
>>   {
>> if (!TARGET_FLOAT)
>>   {
>> aarch64_err_no_fpadvsimd (mode);
>> machine_mode intmode
>>   = int_mode_for_size (GET_MODE_BITSIZE (mode), 0).require ();
>> emit_move_insn (gen_lowpart (intmode, operands[0]),
>> gen_lowpart (intmode, operands[1]));
>> DONE;
>>   }
>>
>> This isn't as user-friendly as catching the error directly in the FE,
>> but I think in practice it's going to be very hard to trap all invalid
>> uses of a type there.  Also, the user error in these situations is likely
>> to be forgetting to enable the right architecture feature, rather than
>> accidentally using the wrong type.  So an error about missing architecture
>> features is probably good enough in most cases.
> 
> I did have a patch which improved the situation for the SVE types to
> provide an error message at compile time when SVE is not enabled
> but I didn't get any feedback from either the C or C++ front-end folks.
> https://gcc.gnu.org/pipermail/gcc-patches/2021-November/583786.html
> 

Nice!  Many thanks for providing this new direction.

> I suspect if that patch gets reviewed by the front-end folks, Kewen
> could use the same infrastructure to error out on the types for rs6000
> backend.

Yeah, I just confirmed that on top of your patch introducing function
rs6000_verify_type_context to take care of those MMA types can fix the
issue in PR106736.  TBH, I'm not sure if your patch can cover all
possible places where a built-in type can be used, but I guess it can
cover the most.

BR,
Kewen

Re: How to extend SLP to support this case

Re: How to extend SLP to support this case

Re: How to extend SLP to support this case

[RFC, doloop] How to deal with invariants introduced by doloop

Re: [RFC, doloop] How to deal with invariants introduced by doloop

Re: [RFC, doloop] How to deal with invariants introduced by doloop

Question on tree LIM

Re: Question on tree LIM

Re: Question on tree LIM

Re: Enable the vectorizer at -O2 for GCC 12

Re: Kewen Lin as PowerPC port co-maintainer

RFC: Make builtin types only valid for some target features

Re: RFC: Make builtin types only valid for some target features

Re: RFC: Make builtin types only valid for some target features

14 matches

Site Navigation

Mail list logo

Footer information