> On 9/3/24 15:07, Jan Hubicka wrote:
> 
> > Hi,
> > We disable gathers for zen4.  It seems that gather has improved a bit 
> > compared
> > to zen4 and Zen5 optimization manual suggests "Avoid GATHER instructions 
> > when
> > the indices are known ahead of time. Vector loads followed by shuffles 
> > result
> > in a higher load bandwidth." however the situation seems to be more
> > complicated.
> 
> A small bit of "real world" experience (but for zen3):
> 
> Recently I switched to gfortran 14.2 for my weather forecasting.
> A year ago I had changed "-march=native -mtune=native" (on my zen3 system)
> to "-march=native -mtune=znver2" while using gfortran 13 - it had only a
> small effect (but positive).
> 
> Last Monday I switched back to "-march=native -mtune=native", but that
> consistently made a 12 hour computation around 6 minutes slower (i.e., about
> 1/120th, or 0.8 %). The most computational intensive part of the code needs
> gather (either instructions or inline expansions of them).

It would be nice to know what is causing this. Gathers can be enabled
using -mtune-ctrl=use_gather and I would be happy to know about real
world situations where they help.

I am still looking into this.  IMO disabling gather like on other zens
makes sense especially for backporting. For trunk
it probably makes sense to look for heuristics carefully enabling
gathers.  It is not clear to me how to benchmark them or how to set up
heuristics.  Spec2017 has very small coverage for loops requiring
gathers and so does tsvc. I did some micro-benchmarks but their
behaviour is, well, puzzling. Having additional data would be great.

As Richard mentioned, it probably makes sense to enable masked gathers,
since the open coded version needs condiitonals and we would not
vectorize at all.  I am not sure if we can do that with current APIs.
I will cook up a micro-benchmarks for that.

Concerning code size, I am not sure how much that applies in practice
since gathers are used relatively sporadically and vectorizer blows up
the code a lot anyways, but certainly one can construct example with
very many loops needing gather...

My guess is that array prefetching data is annotated to the instructoin
cache and since gather produces a lot of loads, probably data simply does
not fit. Opencoding the gather makes extra space for this info...

Honza

Reply via email to