Bernhard Reutner-Fischer via Fortran <fortran@gcc.gnu.org> writes:

> Well we already have
> !GCC$ ATTRIBUTES attribute-list :: var-name [, var-name] ...
>
> See https://gcc.gnu.org/onlinedocs/gfortran/ATTRIBUTES-directive.html

Yes, that's what I was hoping was simple to extend.  Sorry I didn't say
explicitly.

> For target_clones you would most likely need a slightly different parser
> for you need the user to specify the actual target_clones somehow. You
> would probably make a suggestion and discuss the proposal here.
> Ideally the syntax would be the same as in C.

Right.  I hoped it would be possible to lift machinery easily from C.
It wasn't obvious you could, but I didn't spend much time when I looked
at it a while ago.

> ---8<---
> In general, I prefer to stick to standard methods
> (which are portable) and think that those user knobs often make things
> slower than faster (as they tend to stay for years, even after the hard-
> ware as moved on - or they are even inserted blindly).
> ---8<---

There's no standard method for this sort of portable performance
engineering as far as I can tell.  The best I could see was specifying a
SIMD length statically in OpenMP.  I'm interested in things that
potentially make the difference between, say, vectorization for AVX2 or
full-width AVX512 versus SSE2 for profiled host-spots.  I fully agree
about measurement and not doing things blindly, and I prize
maintainability.  However, target_clones is clearly better than the
existing facility for explicit, target-independent unrolling, for instance.

> In former times, you would compile your library multiple times
> and provide a distinct, optimized version for each of the CPUs.
> Maybe that would work for you equally well, without target_clones?

"Former times" to me means, say, GEC 4000 v. IBM 370 and the aftermath
of "all the world's a VAX", rather than different x86
micro-architectures...  I do now work on both x86_64 and POWER.

Multiple compilation isn't a good solution.  I haven't followed the
current state of hardware capability support, but relevant systems don't
have it on x86_64, at least.  That wouldn't help kernels of your
simulation code that aren't abstracted into a library or set up for
dynamic dispatch anyway.  I don't have a specific instance in mind, but
consider OS packaging, which I do; that currently has to be built for
base x86_64 (SSE2) for EPEL, at least, and so could miss a factor of
several performance from vectorized.

> HTH

Thanks.  Definitely a more helpful response than when I asked about
doing something previously!  (I don't know if I'll actually be able to
work on it in the end, at least on work time.)

Reply via email to