> On 9/3/24 15:07, Jan Hubicka wrote: > > > Hi, > > We disable gathers for zen4. It seems that gather has improved a bit > > compared > > to zen4 and Zen5 optimization manual suggests "Avoid GATHER instructions > > when > > the indices are known ahead of time. Vector loads followed by shuffles > > result > > in a higher load bandwidth." however the situation seems to be more > > complicated. > > A small bit of "real world" experience (but for zen3): > > Recently I switched to gfortran 14.2 for my weather forecasting. > A year ago I had changed "-march=native -mtune=native" (on my zen3 system) > to "-march=native -mtune=znver2" while using gfortran 13 - it had only a > small effect (but positive). > > Last Monday I switched back to "-march=native -mtune=native", but that > consistently made a 12 hour computation around 6 minutes slower (i.e., about > 1/120th, or 0.8 %). The most computational intensive part of the code needs > gather (either instructions or inline expansions of them).
It would be nice to know what is causing this. Gathers can be enabled using -mtune-ctrl=use_gather and I would be happy to know about real world situations where they help. I am still looking into this. IMO disabling gather like on other zens makes sense especially for backporting. For trunk it probably makes sense to look for heuristics carefully enabling gathers. It is not clear to me how to benchmark them or how to set up heuristics. Spec2017 has very small coverage for loops requiring gathers and so does tsvc. I did some micro-benchmarks but their behaviour is, well, puzzling. Having additional data would be great. As Richard mentioned, it probably makes sense to enable masked gathers, since the open coded version needs condiitonals and we would not vectorize at all. I am not sure if we can do that with current APIs. I will cook up a micro-benchmarks for that. Concerning code size, I am not sure how much that applies in practice since gathers are used relatively sporadically and vectorizer blows up the code a lot anyways, but certainly one can construct example with very many loops needing gather... My guess is that array prefetching data is annotated to the instructoin cache and since gather produces a lot of loads, probably data simply does not fit. Opencoding the gather makes extra space for this info... Honza