On Wed, Sep 4, 2024 at 12:56 PM Jan Hubicka <hubi...@ucw.cz> wrote:
>
> > On 9/3/24 15:07, Jan Hubicka wrote:
> >
> > > Hi,
> > > We disable gathers for zen4.  It seems that gather has improved a bit 
> > > compared
> > > to zen4 and Zen5 optimization manual suggests "Avoid GATHER instructions 
> > > when
> > > the indices are known ahead of time. Vector loads followed by shuffles 
> > > result
> > > in a higher load bandwidth." however the situation seems to be more
> > > complicated.
> >
> > A small bit of "real world" experience (but for zen3):
> >
> > Recently I switched to gfortran 14.2 for my weather forecasting.
> > A year ago I had changed "-march=native -mtune=native" (on my zen3 system)
> > to "-march=native -mtune=znver2" while using gfortran 13 - it had only a
> > small effect (but positive).
> >
> > Last Monday I switched back to "-march=native -mtune=native", but that
> > consistently made a 12 hour computation around 6 minutes slower (i.e., about
> > 1/120th, or 0.8 %). The most computational intensive part of the code needs
> > gather (either instructions or inline expansions of them).
>
> It would be nice to know what is causing this. Gathers can be enabled
> using -mtune-ctrl=use_gather and I would be happy to know about real
> world situations where they help.
>
> I am still looking into this.  IMO disabling gather like on other zens
> makes sense especially for backporting. For trunk
> it probably makes sense to look for heuristics carefully enabling
> gathers.  It is not clear to me how to benchmark them or how to set up
> heuristics.  Spec2017 has very small coverage for loops requiring
> gathers and so does tsvc. I did some micro-benchmarks but their
> behaviour is, well, puzzling. Having additional data would be great.
>
> As Richard mentioned, it probably makes sense to enable masked gathers,
> since the open coded version needs condiitonals and we would not
> vectorize at all.  I am not sure if we can do that with current APIs.
> I will cook up a micro-benchmarks for that.

See also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85919, the
targetm.vectorize.builtin_gather/targetm.vectorize.builtin_scatter interface
is legacy and does not support masking at all.

> Concerning code size, I am not sure how much that applies in practice
> since gathers are used relatively sporadically and vectorizer blows up
> the code a lot anyways, but certainly one can construct example with
> very many loops needing gather...
>
> My guess is that array prefetching data is annotated to the instructoin
> cache and since gather produces a lot of loads, probably data simply does
> not fit. Opencoding the gather makes extra space for this info...
>
> Honza
>

Reply via email to