On Tue, 2013-12-03 at 21:57 +0000, Yufeng Zhang wrote:
> On 12/03/13 20:35, Richard Biener wrote:
> > Yufeng Zhang<yufeng.zh...@arm.com>  wrote:
> >> On 12/03/13 14:20, Richard Biener wrote:
> >>> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<yufeng.zh...@arm.com>
> >> wrote:
> >>>> On 12/03/13 06:48, Jeff Law wrote:
> >>>>>
> >>>>> On 12/02/13 08:47, Yufeng Zhang wrote:
> >>>>>>
> >>>>>> Ping~
> >>>>>>
> >>>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Yufeng
> >>>>>>
> >>>>>> On 11/26/13 15:02, Yufeng Zhang wrote:
> >>>>>>>
> >>>>>>> On 11/26/13 12:45, Richard Biener wrote:
> >>>>>>>>
> >>>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
> >>>>>>>> Zhang<yufeng.zh...@arm.com>      wrote:
> >>>>>>>>>
> >>>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote:
> >>>>>>>>>>
> >>>>>>>>>> The second version of your original patch is ok with me with
> >> the
> >>>>>>>>>> following changes.  Sorry for the little side adventure into
> >> the
> >>>>>>>>>> next-interp logic; in the end that's going to hurt more than
> >> it
> >>>>>>>>>> helps in
> >>>>>>>>>> this case.  Thanks for having a look at it, anyway.  Thanks
> >> also for
> >>>>>>>>>> cleaning up this version to be less intrusive to common
> >> interfaces; I
> >>>>>>>>>> appreciate it.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks a lot for the review.  I've attached an updated patch
> >> with the
> >>>>>>>>> suggested changes incorporated.
> >>>>>>>>>
> >>>>>>>>> For the next-interp adventure, I was quite happy to do the
> >>>>>>>>> experiment; it's
> >>>>>>>>> a good chance of gaining insight into the pass.  Many thanks
> >> for
> >>>>>>>>> your prompt
> >>>>>>>>> replies and patience in guiding!
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> Everything else looks OK to me.  Please ask Richard for final
> >>>>>>>>>> approval,
> >>>>>>>>>> as I'm not a maintainer.
> >>>>>
> >>>>> First a note, I need to check on voting for Bill as the slsr
> >> maintainer
> >>>>> from the steering committee.   Voting was in progress just before
> >> the
> >>>>> close of stage1 development so I haven't tallied the results :-)
> >>>>
> >>>>
> >>>> Looking forward to some good news! :)
> >>>>
> >>>>
> >>>>>>>
> >>>>>>> Yes, you are right about the non-trivial 'base' tree are rarely
> >> shared.
> >>>>>>>       The cached is introduced mainly because get_alternative_base
> >> () may
> >>>>>>> be
> >>>>>>> called twice on the same 'base' tree, once in the
> >>>>>>> find_basis_for_candidate () for look-up and the other time in
> >>>>>>> alloc_cand_and_find_basis () for record_potential_basis ().  I'm
> >> happy
> >>>>>>> to leave out the cache if you think the benefit is trivial.
> >>>>>
> >>>>> Without some sense of how expensive the lookups are vs how often
> >> the
> >>>>> cache hits it's awful hard to know if the cache is worth it.
> >>>>>
> >>>>> I'd say take it out unless you have some sense it's really saving
> >> time.
> >>>>>      It's a pretty minor implementation detail either way.
> >>>>
> >>>>
> >>>> I think the affine tree routines are generally expensive; it is
> >> worth having
> >>>> a cache to avoid calling them too many times.  I run the slsr-*.c
> >> tests
> >>>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range
> >> from
> >>>> 55.6% to 90%, with 73.5% as the average.  The samples may not well
> >> represent
> >>>> the real world scenario, but they do show the fact that the 'base'
> >> tree can
> >>>> be shared to some extent.  So I'd like to have the cache in the
> >> patch.
> >>>>
> >>>>
> >>>>>
> >>>>>>>
> >>>>>>>> +/* { dg-do compile } */
> >>>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */
> >>>>>>>> +
> >>>>>>>> +typedef int arr_2[50][50];
> >>>>>>>> +
> >>>>>>>> +void foo (arr_2 a2, int v1)
> >>>>>>>> +{
> >>>>>>>> +  int i, j;
> >>>>>>>> +
> >>>>>>>> +  i = v1 + 5;
> >>>>>>>> +  j = i;
> >>>>>>>> +  a2 [i-10] [j] = 2;
> >>>>>>>> +  a2 [i] [j++] = i;
> >>>>>>>> +  a2 [i+20] [j++] = i;
> >>>>>>>> +  a2 [i-3] [i-1] += 1;
> >>>>>>>> +  return;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
> >>>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */
> >>>>>>>>
> >>>>>>>> scanning for 5 MEMs looks non-sensical.  What transform do
> >>>>>>>> you expect?  I see other slsr testcases do similar non-sensical
> >>>>>>>> checking which is bad, too.
> >>>>>>>
> >>>>>>>
> >>>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them
> >> to
> >>>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of
> >> MEM_REFs
> >>>>>>> is an effective check.  Alternatively, I can add a follow-up
> >> patch to
> >>>>>>> add some dumping facility in replace_ref () to print out the
> >> replacing
> >>>>>>> actions when -fdump-tree-slsr-details is on.
> >>>>>
> >>>>> I think adding some details to the dump and scanning for them would
> >> be
> >>>>> better.  That's the only change that is required for this to move
> >> forward.
> >>>>
> >>>>
> >>>> I've updated to patch to dump more details when
> >> -fdump-tree-slsr-details is
> >>>> on.  The tests have also been updated to scan for these new dumps
> >> instead of
> >>>> MEMs.
> >>>>
> >>>>
> >>>>>
> >>>>> I suggest doing it quickly.  We're well past stage1 close at this
> >> point.
> >>>>
> >>>>
> >>>> The bootstrapping on x86_64 is still running.  OK to commit if it
> >> succeeds?
> >>>
> >>> I still don't like it.  It's using the wrong and too expensive tools
> >> to do
> >>> stuff.  What kind of bases are we ultimately interested in?  Browsing
> >>> the code it looks like we're having
> >>>
> >>>     /* Base expression for the chain of candidates:  often, but not
> >>>        always, an SSA name.  */
> >>>     tree base_expr;
> >>>
> >>> which isn't really too informative but I suppose they are all
> >>> kind-of-gimple_val()s?  That said, I wonder if you can simply
> >>> use get_addr_base_and_unit_offset in place of get_alternative_base
> >> (),
> >>> ignoring the returned offset.
> >>
> >> 'base_expr' is essentially the base address of a handled_component_p,
> >> e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of
> >>
> >> the object returned by get_inner_reference ().
> >>
> >> Given a test case like the following:
> >>
> >> typedef int arr_2[20][20];
> >>
> >> void foo (arr_2 a2, int i, int j)
> >> {
> >>    a2[i+10][j] = 1;
> >>    a2[i+10][j+1] = 1;
> >>    a2[i+20][j] = 1;
> >> }
> >>
> >> The IR before SLSR is (on x86_64):
> >>
> >>    _2 = (long unsigned int) i_1(D);
> >>    _3 = _2 * 80;
> >>    _4 = _3 + 800;
> >>    _6 = a2_5(D) + _4;
> >>    *_6[j_8(D)] = 1;
> >>    _10 = j_8(D) + 1;
> >>    *_6[_10] = 1;
> >>    _12 = _3 + 1600;
> >>    _13 = a2_5(D) + _12;
> >>    *_13[j_8(D)] = 1;
> >>
> >> The base_expr for the 1st and 2nd memory reference are the same, i.e.
> >> _6, while the base_expr for a2[i+20][j] is _13.
> >>
> >> _13 is essentially (_6 + 800), so all of the three memory references
> >> essentially share the same base address.  As their strides are also the
> >>
> >> same (MULT_EXPR (j, 4)), the three references can all be lowered to
> >> MEM_REFs.  What this patch does is to use the tree affine tools to help
> >>
> >> recognize the underlying base address expression; as it requires
> >> looking
> >> into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
> >> won't help here.
> >>
> >> Bill has helped me exploit other ways of achieving this in SLSR, but so
> >>
> >> far we think this is the best way to proceed.  The use of tree affine
> >> routines has been restricted to CAND_REFs only and there is the
> >> aforementioned cache facility to help reduce the overhead.
> >>
> >> Thanks,
> >> Yufeng
> >>
> >> P.S. some more details what the patch does:
> >>
> >> The CAND_REF for the three memory references are:
> >>
> >>   6  [2] *_6[j_8(D)] = 1;
> >>       REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> >>       basis: 0  dependent: 8  sibling: 0
> >>       next-interp: 0  dead-savings: 0
> >>
> >>    8  [2] *_6[_10] = 1;
> >>       REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
> >>       basis: 6  dependent: 11  sibling: 0
> >>       next-interp: 0  dead-savings: 0
> >>
> >>   11  [2] *_13[j_8(D)] = 1;
> >>       REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
> >>       basis: 8  dependent: 0  sibling: 0
> >>       next-interp: 0  dead-savings: 0
> >>
> >> Before the patch, the strength reduction candidate chains for the three
> >>
> >> CAND_REFs are:
> >>
> >>    _6 ->  6 ->  8
> >>    _13 ->  11
> >>
> >> i.e. SLSR recognizes the first two references share the same basis,
> >> while the last one is on it own.
> >>
> >> With the patch, an extra candidate chain can be recognized:
> >>
> >>    a2_5(D) + (sizetype) i_1(D) * 80 ->  6 ->  11 ->  8
> >>
> >> i.e. all of the three references are found to have the same basis
> >> (a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
> >> _6
> >> or _13, with the immediate offset removed.  The pass is now able to
> >> lower all of the three references, instead of the first two only, to
> >> MEM_REFs.
> >
> > Ok, so slsr handles arbitrary complex bases and figures out common 
> > components? If so, then why not just use get_inner_reference?
> 
> slsr is indeed already using get_inner_reference () to figure out the 
> common components; see restructure_reference () and the comment at the 
> beginning of gimple-ssa-strength-reduction.c.  Quote some of the comment 
> here for the convenience:
> 
>     Specifically, we are interested in references for which
>     get_inner_reference returns a base address, offset, and bitpos as
>     follows:
> 
>       base:    MEM_REF (T1, C1)
>       offset:  MULT_EXPR (PLUS_EXPR (T2, C2), C3)
>       bitpos:  C4 * BITS_PER_UNIT
> 
>     Here T1 and T2 are arbitrary trees, and C1, C2, C3, C4 are
>     arbitrary integer constants.  Note that C2 may be zero, in which
>     case the offset will be MULT_EXPR (T2, C3).
> 
>     When this pattern is recognized, the original memory reference
>     can be replaced with:
> 
>       MEM_REF (POINTER_PLUS_EXPR (T1, MULT_EXPR (T2, C3)),
>                C1 + (C2 * C3) + C4)
> 
>     which distributes the multiply to allow constant folding.  When
>     two or more addressing expressions can be represented by MEM_REFs
>     of this form, differing only in the constants C1, C2, and C4,
>     making this substitution produces more efficient addressing during
>     the RTL phases.  When there are not at least two expressions with
>     the same values of T1, T2, and C3, there is nothing to be gained
>     by the replacement.
> 
>     Strength reduction of CAND_REFs uses the same infrastructure as
>     that used by CAND_MULTs and CAND_ADDs.  We record T1 in the base (B)
>     field, MULT_EXPR (T2, C3) in the stride (S) field, and
>     C1 + (C2 * C3) + C4 in the index (i) field.  A basis for a CAND_REF
>     is thus another CAND_REF with the same B and S values.  When at
>     least two CAND_REFs are chained together using the basis relation,
>     each of them is replaced as above, resulting in improved code
>     generation for addressing.
> 
> As the last paragraphs says, a basis for a CAND_REF is another CAND_REF 
> with the same B and S values.  This patch extends the definition of 
> basis by allowing B's not to be the same but only differ by an immediate 
> constant.
> 
> >  After all slsr does not use tree-affine as representation for bases (which 
> > it could?)
> 
> In theory it could use tree-affine for bases and I had experimented this 
> approach as well, but encountered some unexpected re-association issue 
> when building spec2k, as when tree-affine is combined to tree, the 
> association order can be different from, or worse than, what was before 
> tree-affine.  I also didn't see any obvious benefit, so didn't proceed 
> further.
> 
> In the long run, the additional lowering of memory accesses you 
> mentioned in http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03731.html may 
> be a better solution to what I'm trying to tackle here.  I'll see if I 
> can get time to work out something useful for 4.10. :)

This can be pretty dicey as well.  I originally tried to address the
reference candidate commoning by doing earlier lowering of memory
references.  The problem I ran into is a premature loss of aliasing
information, resulting in much worse code generation in a number of
places.  We couldn't think of a good way to carry the lost aliasing
information forward that wasn't a complete bloody hack.  So I backed off
from that.  Richard then made the astute observation that this was a
special case of straight line strength reduction, which GCC didn't
handle yet.  And that's how we ended up where we are...

I don't want to discourage you from looking at further lowering of
memory reference components, but be aware that you need to be thinking
carefully about the TBAA issue from the start.  Otherwise the RTL phases
can be much more constrained (particularly scheduling).

Bill

> 
> Yufeng
> 

Reply via email to