On Wed, Sep 4, 2024 at 12:56 PM Jan Hubicka <hubi...@ucw.cz> wrote: > > > On 9/3/24 15:07, Jan Hubicka wrote: > > > > > Hi, > > > We disable gathers for zen4. It seems that gather has improved a bit > > > compared > > > to zen4 and Zen5 optimization manual suggests "Avoid GATHER instructions > > > when > > > the indices are known ahead of time. Vector loads followed by shuffles > > > result > > > in a higher load bandwidth." however the situation seems to be more > > > complicated. > > > > A small bit of "real world" experience (but for zen3): > > > > Recently I switched to gfortran 14.2 for my weather forecasting. > > A year ago I had changed "-march=native -mtune=native" (on my zen3 system) > > to "-march=native -mtune=znver2" while using gfortran 13 - it had only a > > small effect (but positive). > > > > Last Monday I switched back to "-march=native -mtune=native", but that > > consistently made a 12 hour computation around 6 minutes slower (i.e., about > > 1/120th, or 0.8 %). The most computational intensive part of the code needs > > gather (either instructions or inline expansions of them). > > It would be nice to know what is causing this. Gathers can be enabled > using -mtune-ctrl=use_gather and I would be happy to know about real > world situations where they help. > > I am still looking into this. IMO disabling gather like on other zens > makes sense especially for backporting. For trunk > it probably makes sense to look for heuristics carefully enabling > gathers. It is not clear to me how to benchmark them or how to set up > heuristics. Spec2017 has very small coverage for loops requiring > gathers and so does tsvc. I did some micro-benchmarks but their > behaviour is, well, puzzling. Having additional data would be great. > > As Richard mentioned, it probably makes sense to enable masked gathers, > since the open coded version needs condiitonals and we would not > vectorize at all. I am not sure if we can do that with current APIs. > I will cook up a micro-benchmarks for that.
See also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85919, the targetm.vectorize.builtin_gather/targetm.vectorize.builtin_scatter interface is legacy and does not support masking at all. > Concerning code size, I am not sure how much that applies in practice > since gathers are used relatively sporadically and vectorizer blows up > the code a lot anyways, but certainly one can construct example with > very many loops needing gather... > > My guess is that array prefetching data is annotated to the instructoin > cache and since gather produces a lot of loads, probably data simply does > not fit. Opencoding the gather makes extra space for this info... > > Honza >