Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On Fri, 17 Sep 2021, Richard Biener wrote:
>
>> On Fri, 17 Sep 2021, Richard Sandiford wrote:
>> 
>> > Richard Biener <rguent...@suse.de> writes:
>> > > This adds the capability to analyze the dependence of mixed
>> > > pointer/array accesses.  The example is from where using a masked
>> > > load/store creates the pointer-based access when an otherwise
>> > > unconditional access is array based.  Other examples would include
>> > > accesses to an array mixed with accesses from inlined helpers
>> > > that work on pointers.
>> > >
>> > > The idea is quite simple and old - analyze the data-ref indices
>> > > as if the reference was pointer-based.  The following change does
>> > > this by changing dr_analyze_indices to work on the indices
>> > > sub-structure and storing an alternate indices substructure in
>> > > each data reference.  That alternate set of indices is analyzed
>> > > lazily by initialize_data_dependence_relation when it fails to
>> > > match-up the main set of indices of two data references.
>> > > initialize_data_dependence_relation is refactored into a head
>> > > and a tail worker and changed to work on one of the indices
>> > > structures and thus away from using DR_* access macros which
>> > > continue to reference the main indices substructure.
>> > >
>> > > There are quite some vectorization and loop distribution opportunities
>> > > unleashed in SPEC CPU 2017, notably 520.omnetpp_r, 548.exchange2_r,
>> > > 510.parest_r, 511.povray_r, 521.wrf_r, 526.blender_r, 527.cam4_r and
>> > > 544.nab_r see amendments in what they report with -fopt-info-loop while
>> > > the rest of the specrate set sees no changes there.  Measuring runtime
>> > > for the set where changes were reported reveals nothing off-noise
>> > > besides 511.povray_r which seems to regress slightly for me
>> > > (on a Zen2 machine with -Ofast -march=native).
>> > >
>> > > Changes from the [RFC] version are properly handling bitfields
>> > > that we cannot take the address of and optimization of refs
>> > > that already are MEM_REFs and thus won't see any change.  I've
>> > > also elided changing the set of vect_masked_stores targets in
>> > > favor of explicitely listing avx (but I did not verify if the
>> > > testcase passes on aarch64-sve or amdgcn).
>> > >
>> > > The improves cases like the following from Povray:
>> > >
>> > >    for(i = 0; i < Sphere_Sweep->Num_Modeling_Spheres; i++)
>> > >      {
>> > >         VScaleEq(Sphere_Sweep->Modeling_Sphere[i].Center, Vector[X]);
>> > >         Sphere_Sweep->Modeling_Sphere[i].Radius *= Vector[X];
>> > >      }
>> > >
>> > > where there is a plain array access mixed with abstraction
>> > > using T[] or T* arguments.  That should be a not too uncommon
>> > > situation in the wild.  The loop above is now vectorized and was not
>> > > without the change.
>> > >
>> > > Bootstrapped and tested on x86_64-unknown-linux-gnu and I've
>> > > built and run SPEC CPU 2017 successfully.
>> > >
>> > > OK?
>> > 
>> > Took a while to page this stuff back in :-/
>> > 
>> > I guess if we're adding alt_indices to the main data_reference,
>> > we'll need to free the access_fns in free_data_ref.  It looked like
>> > the patch as posted might have a memory leak.
>> 
>> Doh, yes - thanks for noticing.
>> 
>> > Perhaps instead we could use local indices structures for the
>> > alt_indices and pass them in to initialize_data_dependence_relation?
>> > Not that that's very elegant…
>> 
>> Yeah, I had that but then since for N data references we possibly
>> call initialize_data_dependence_relation N^2 times we'd do this
>> alternate analysis N^2 times at worst instead of N so it looked worth
>> caching it in the data reference.  Since we have no idea why the
>> first try fails we're going to try this fallback in the majority
>> of cases that we cannot figure out otherwise so I didn't manage
>> to hand-wave the quadraticness away ;)  OTOH one might argue
>> it's just a constant factor ontop of the number of
>> initialize_data_dependence_relation invocations.
>> 
>> So I can probably be convinced either way - guess I'll gather
>> some statistics.
>
> I built SPEC 2017 CPU rate with -Ofast -march=znver2, overall there
> are
>
>  4433976     calls to the first stage initialize_data_dependence_relation
>              (skipping the cases dr_may_alias returned false)
>  360248 (8%) ended up recursing with a set of alt_indices
>  83512       times we computed alt_indices of a DR (that's with the cache)
>  14905 (0.3%) times the recursive invocation ended with != chrec_dont_know
>
> thus when not doing the caching we'd compute alt_indices about 10 times
> more often.  I didn't collect the number of distinct DRs (that's difficult
> at this point), but I'd estimate from the above that we have 3 times
> more "unused" alt_indices than used.
>
> OK, so that didn't really help me avoid flipping a coin ;)

Sounds like a good justification for keeping the caching to me :-)

Richard

Reply via email to