Bernhard Reutner-Fischer via Fortran <fortran@gcc.gnu.org> writes: > Well we already have > !GCC$ ATTRIBUTES attribute-list :: var-name [, var-name] ... > > See https://gcc.gnu.org/onlinedocs/gfortran/ATTRIBUTES-directive.html
Yes, that's what I was hoping was simple to extend. Sorry I didn't say explicitly. > For target_clones you would most likely need a slightly different parser > for you need the user to specify the actual target_clones somehow. You > would probably make a suggestion and discuss the proposal here. > Ideally the syntax would be the same as in C. Right. I hoped it would be possible to lift machinery easily from C. It wasn't obvious you could, but I didn't spend much time when I looked at it a while ago. > ---8<--- > In general, I prefer to stick to standard methods > (which are portable) and think that those user knobs often make things > slower than faster (as they tend to stay for years, even after the hard- > ware as moved on - or they are even inserted blindly). > ---8<--- There's no standard method for this sort of portable performance engineering as far as I can tell. The best I could see was specifying a SIMD length statically in OpenMP. I'm interested in things that potentially make the difference between, say, vectorization for AVX2 or full-width AVX512 versus SSE2 for profiled host-spots. I fully agree about measurement and not doing things blindly, and I prize maintainability. However, target_clones is clearly better than the existing facility for explicit, target-independent unrolling, for instance. > In former times, you would compile your library multiple times > and provide a distinct, optimized version for each of the CPUs. > Maybe that would work for you equally well, without target_clones? "Former times" to me means, say, GEC 4000 v. IBM 370 and the aftermath of "all the world's a VAX", rather than different x86 micro-architectures... I do now work on both x86_64 and POWER. Multiple compilation isn't a good solution. I haven't followed the current state of hardware capability support, but relevant systems don't have it on x86_64, at least. That wouldn't help kernels of your simulation code that aren't abstracted into a library or set up for dynamic dispatch anyway. I don't have a specific instance in mind, but consider OS packaging, which I do; that currently has to be built for base x86_64 (SSE2) for EPEL, at least, and so could miss a factor of several performance from vectorized. > HTH Thanks. Definitely a more helpful response than when I asked about doing something previously! (I don't know if I'll actually be able to work on it in the end, at least on work time.)