On Fri, May 7, 2021 at 6:29 AM Feng Xue OS <[email protected]> wrote:
>
> >> gcc/
> >> PR tree-optimization/98598
> >> * Makefile.in (OBJS): Add tree-ssa-loop-mgo.o.
> >> * common.opt (-ftree-loop-mgo): New option.
> >
> > Just a quick comment - -ftree-loop-mgo is user-facing and it isn't really a
> > good
> > name. -floop-mgo would be better but still I'd have no idea what this
> > would do.
> >
> > I don't have a good suggestion here other than to expand it to
> > -floop-gather-memory (?!).
>
> OK. Better than "mgo", this abbr. is only a term for development use.
>
> > The option documentation isn't informative either.
> >
> > From:
> >
> > outer-loop ()
> > {
> > inner-loop (iter, iter_count)
> > {
> > Type1 v1 = LOAD (iter);
> > Type2 v2 = LOAD (v1);
> > Type3 v3 = LOAD (v2);
> > ...
> > iter = NEXT (iter);
> > }
> > }
> >
> > To:
> >
> > typedef struct cache_elem
> > {
> > bool init;
> > Type1 c_v1;
> > Type2 c_v2;
> > Type3 c_v3;
> > } cache_elem;
> >
> > cache_elem *cache_arr = calloc (iter_count, sizeof (cache_elem));
> >
> > outer-loop ()
> > {
> > size_t cache_idx = 0;
> >
> > inner-loop (iter, iter_count)
> > {
> > if (!cache_arr[cache_idx]->init)
> > {
> > v1 = LOAD (iter);
> > v2 = LOAD (v1);
> > v3 = LOAD (v2);
> >
> > cache_arr[cache_idx]->init = true;
> > cache_arr[cache_idx]->c_v1 = v1;
> > cache_arr[cache_idx]->c_v2 = v2;
> > cache_arr[cache_idx]->c_v3 = v3;
> > }
> > else
> > {
> > v1 = cache_arr[cache_idx]->c_v1;
> > v2 = cache_arr[cache_idx]->c_v2;
> > v3 = cache_arr[cache_idx]->c_v3;
> > }
> > ...
> > cache_idx++;
> > iter = NEXT (iter);
> > }
> > }
> >
> > free (cache_arr);
> >
> > This is a _very_ special transform. What it seems to do is
> > optimize the dependent loads for outer loop iteration n > 1
> > by caching the result(s). If that's possible then you should
> > be able to distribute the outer loop to one doing the caching
> > and one using the cache. Then this transform would be more
> > like a tradidional array expansion of scalars? In some cases
> > also loop interchange could remove the need for the caching.
> >
> > Doing MGO as the very first loop pass thus looks bad, I think
> > MGO should be much later, for example after interchange.
> > I also think that MGO should work in concert with loop
> > distribution (which could have an improved cost model)
> > rather than being a separate pass.
> >
> > Your analysis phase looks quite expensive, building sth
> > like a on-the side representation very closely matching SSA.
> > It seems to work from PHI defs to uses, which looks backwards.
>
> Did not catch this point very clearly. Would you please detail it more?
I don't remember exactly but you are building a lot of data structures
that resemble ones readily available when you do find_dep_loads.
You're looking at each and every stmt searching for operands
matching up sth and then you're walking SSA uses again.
And the searching uses linear vector walks (vec::contains).
> > You seem to roll your own dependence analysis code :/ Please
> > have a look at loop distribution.
> >
> > Also you build an actual structure type for reasons that escape
> > me rather than simply accessing the allocated storage at
> > appropriate offsets.
> >
> > I think simply calling 'calloc' isn't OK because you might need
> > aligned storage and because calloc might not be available.
> > Please at least use 'malloc' and make sure MALLOC_ABI_ALIGNMENT
> > is large enough for the data you want to place (or perform
> > dynamic re-alignment yourself). We probably want some generic
> > middle-end utility to obtain aligned allocated storage at some
> > point.
> >
> > As said above I think you want to re-do this transform as
> > a loop distribution transform. I think if caching works then
> > the loads should be distributable and the loop distribution
> > transform should be enhanced to expand the scalars to arrays.
>
> I checked code of loop distribution, and its trigger strategy seems
> to be very conservative, now only targets simple and regular
> index-based loop, and could not handle link-list traversal, which
> consists of a series of discrete memory accesses, and MGO would
> matter a lot. Additionally, for some complicate cases, we could
> not completely decompose MGO as two separate loops for
> "do caching" and "use caching" respectively. An example:
>
> for (i = 0; i < N; i++)
> {
> for (j = 0; j < i; j++)
> {
> Type1 v1 = LOAD_FN1 (j);
> Type2 v2 = LOAD_FN2 (v1);
> Type3 v3 = LOAD_FN3 (v2);
>
> ...
>
> condition = ...
> }
>
> if (condition)
> break;
> }
>
> We should not cache all loads (Totally N) in one step since some
> of them might be invalid after "condition" breaks loops. We have to
> mix up "do caching" and "use caching", and let them dynamically
> switched against "init" flag.
Maybe, but then classical array expansion is useful and runs into
the same dynamic allocation problem. I suppose loop distribution
could be teached the dependent loads case - it's not unlike
gathers.
> But loop distribution does have some
> overlap on analysis and transformation with MGO, we will try to
> see if there is a way to unify them.
Thanks.
Richard.
> Thanks,
> Feng