Function multiversioning ABI issues

2023-10-10 Thread Andrew Carlotti via Gcc
Hi,

I've been looking into existing function multiversioning implementations (while
working to add support for fmv in GCC on aarch64).  It seems there are various
inconsistencies among current implementations, and it's unclear to me which (if
any) of these differences could be problematic.  There's a list of observations
of current behaviour at the end of the email.

I've also seen the GCC documentation for the ifunc attribute [1].  This states
that "the indirect function needs to be defined in the same translation unit as
the resolver function".  This is not how function multiversioning is currently
implemented.  Instead, the resolver functions are added to the translation
units of every caller.  What is the reason for the restriction specified in the
documentation?  Does this mean that current function multiversioning
implementations are subtly broken when used across different translation units?

On a related point - how important is the mangling of the symbols involved (for
the function implementations, the resolver, and the symbol that will be
resolved at load time)?  We're not consistent about how we mangle these on any
target - even the explicit mangling requirements in the aarch64 Beta
specification [2] apply only to the function implementations, and make no
mention of how the other symbols should be mangled.

It seems to me that things would have been much simpler if the resolver were
always in the same translation unit as the implementations, and the symbol that
is resolved at run time used the original function symbol name.  This would
allow the mangling of the function versions and the resolver function to be a
hidden implementation detail, with no need to specify it as part of an explicit
or implicit ABI.  It would add an explicit restriction that all function
implementations must be in the same translation unit, but it seems that might
be required for absolute correctness anyway.

So, how much of this stuff matters?  Is some of this stuff actually broken?
Will we have horrible compatibility issues if we try to fix some of it?

Thanks,
Andrew

(cc'd to Daniel Kiss and Pavel Iliin, who have been working on the aarch64
function multiversioning specification and LLVM implementation respectively.)

---
Current behaviour:


For all combinations of the following, I checked to see whether the compiler
would emit a resolver function:
a) GCC / Clang
b) aarch64 / rs6000 / i386
c) target_clones attribute / target (or target_version) attribute
d) function is fully implemented / function is just declared
e) function is referenced / function is unreferenced

Function multiversioning is not supported for rs6000 (power) in Clang.  For
aarch64, I used my WIP implementation in GCC.

If the function was referenced, then the compiler always emitted a resolver.
However, when the function is declared but not implemented, then GCC for rs6000
would emit a resolver that always used the default version.

If the function was not referenced, then the compiler usually did not emit a
resolver.  The exceptions to this were both in GCC, with resolvers being
emitted both for rs6000 when using an implemented function with target_clones,
and for aarch64 (with my WIP patch) when using an implementations with
target_version.


There are also many discrepancies in the mangling used in the various symbol
names:

For the symbol that will be resolved at load time, Clang always uses the
original name and append ".ifunc".  GCC leaves the symbol name unchanged when
using target_clones, but adds a second layer of C++ mangling when using target
attribute versions.

For the resolver function, the symbol name is always generated by appending
".resolver" to the original mangled function name.

For the individual versions, the following changes are made:
- Clang for i386 numbers the versions when using target_clones (by appending
  ".0", ".1", etc.).  GCC used to share this behaviour, but the numbering was
removed for GCC 12 (commit bfc9250e).
- GCC for i386, when using target_clones, will substitute any non-alphanumeric
  character with "_" when formign the suffix.  E.g, an "sse4.2" version of foo
has a symbol name of "_Z3foov.sse4.2" using target attributes, but
"_Z3foov.sse4_2" using target_clones.
- GCC for rs6000 has no target-defined mangling for versions using the "target"
  attribute, so the compiler output will fail to assemble due to duplicate
symbol names.  However, there seem to be no errors or warnings to detect this
at an earlier stage.

(Incidentally, is there any specific documentation for rs6000 function
multiversioning? I was surprised to discover during testing that the resolver
was ignoring some of my function clones, until I found the list of supported
multiversioning targets in rs6000_clone_map.)

---

[1] 
https://gcc.gnu.org/onlinedocs/gcc-13.2.0/gcc/Common-Function-Attributes.html#index-ifunc-function-attribute

[2] 
https://github.com/ARM-software/acle/blob/main/main/acle.md#function-multi-versioning


Re: Function multiversioning ABI issues

2023-10-11 Thread Andrew Carlotti via Gcc
On Wed, Oct 11, 2023 at 10:59:10AM +0200, Florian Weimer wrote:
> * Andrew Carlotti via Gcc:
> 
> > I've also seen the GCC documentation for the ifunc attribute [1].
> > This states that "the indirect function needs to be defined in the
> > same translation unit as the resolver function".  This is not how
> > function multiversioning is currently implemented.  Instead, the
> > resolver functions are added to the translation units of every caller.
> 
> I don't see how this can happen.  Do you have a declaration of the
> resolver function in a shared header, by chance?
> 
> Thanks,
> Florian

I haven't explicity declared a separate function. I just included the normal
function multiversioning attributes on the function declarations.

If you don't include the attributes on the declarations, then you will get one
of two issues:

- For target_clones, you would need to ensure that the mutiversioned function
  has a caller in the same translation unit as the implementations. If you 
  don't do this, then no resolver will be generated, with all of the callers
  referencing a non-existent symbol.

- For target versions, any caller that cannot see the function multiversioning
  attributes will only ever use the default version of the function. This also
  applies to the current aarch64 specification for target_clones and
  target_version.


Re: Help needed with maintainer-mode

2024-03-06 Thread Andrew Carlotti via Gcc
On Thu, Feb 29, 2024 at 06:39:54PM +0100, Christophe Lyon via Gcc wrote:
> On Thu, 29 Feb 2024 at 12:00, Mark Wielaard  wrote:
> >
> > Hi Christophe,
> >
> > On Thu, Feb 29, 2024 at 11:22:33AM +0100, Christophe Lyon via Gcc wrote:
> > > I've noticed that sourceware's buildbot has a small script
> > > "autoregen.py" which does not use the project's build system, but
> > > rather calls aclocal/autoheader/automake/autoconf in an ad-hoc way.
> > > Should we replicate that?
> >
> > That python script works across gcc/binutils/gdb:
> > https://sourceware.org/cgit/builder/tree/builder/containers/autoregen.py
> >
> > It is installed into a container file that has the exact autoconf and
> > automake version needed to regenerate the autotool files:
> > https://sourceware.org/cgit/builder/tree/builder/containers/Containerfile-autotools
> >
> > And it was indeed done this way because that way the files are
> > regenerated in a reproducible way. Which wasn't the case when using 
> > --enable-maintainer-mode (and autoreconfig also doesn't work).
> 
> I see. So it is possibly incomplete, in the sense that it may lack
> some of the steps that maintainer-mode would perform?
> For instance, gas for aarch64 has some *opcodes*.c files that need
> regenerating before committing. The regeneration step is enabled in
> maintainer-mode, so I guess the autoregen bots on Sourceware would
> miss a problem with these files?
> 
> Thanks,
> 
> Christophe

Speaking of opcodes/aarch64-{asm|dis|opc}-2.c - why are these in the source
directory in the first place?  For a similar situation in GCC (gimple-match,
generic-match, insn-emit, etc.) we write the generated files to the build
directory, and generation is always enabled.  I see no reason not to do the
same thing for aarch64-{asm|dis|opc}-2.c.

Andrew

> >
> > It is run on all commits and warns if it detects a change in the
> > (checked in) generated files.
> > https://builder.sourceware.org/buildbot/#/builders/gcc-autoregen
> > https://builder.sourceware.org/buildbot/#/builders/binutils-gdb-autoregen
> >
> > Cheers,
> >
> > Mark


Proposed new pass to optimise mode register assignments

2024-09-06 Thread Andrew Carlotti via Gcc
Hi,

I'm working on optimising assignments to the AArch64 Floating-point Mode
Register (FPMR), as part of our FP8 enablement work.  Claudio has already
implemented FPMR as a hard register, with the intention that FP8 intrinsic
functions will compile to a combination of an fpmr register set, followed by an
FP8 operation that takes fpmr as an input operand.

It would clearly be inefficient to retain an explicit FPMR assignment prior to
each FP8 instruction (especially in the common case where every assignment uses
the same FPMR value).  I think the best way to optimise this would be to
implement a new pass that can optimise assignments to individual hard registers.

There are a number of existing passes that do similar optimisations, but which
I believe are unsuitable for this scenario for various reasons.  For example:

- cse1 can already optimise FPMR assignments within an extended basic block,
  but can't handle broader optimisations.
- pre (in gcse.c) doesn't work with assigning constant values, which would miss
  many potential usages.  It also has limits on how far code can be moved,
  based around ideas of register pressure that don't apply to the context of a
  single hard register that shouldn't be used by the register allocator for
  anything else.  Additionally, it doesn't run at -Os.
- hoist (also using gcse.c) only handles constant values, and only runs when
  optimising for size.  It also has the rest of the issues that pre does.
- mode_sw only handles a small finite set of modes.  The mode requirements are
  determined solely by the instructions that require the specific mode, so mode
  switches don't depend on the output of previous instructions.


My intention would be for the new pass to reuse ideas, and hopefully some of
the existing code, from the mode-switching and gcse passes.  In particular,
gcse.c (or it's dependencies) has code that could identify when values assigned
to the FPMR are known to be the same (although we may not need the full CSE
capabilities of gcse.c), and mode-switching.cc knows how to globally optimise
mdoe assignments (and unlike gcse.c, doesn't use cautious heuristics to avoid
excessively increasing register pressure).

Initially the new pass would only apply to the AArch64 FPMR register, but in
future it could also be used for other hard registers with similar properties.

Does anyone have any comments on this approach, before I start writing any
code?

Thanks,
Andrew




Re: Proposed new pass to optimise mode register assignments

2024-09-08 Thread Andrew Carlotti via Gcc
On Sat, Sep 07, 2024 at 09:09:52AM +0200, Richard Biener wrote:
> 
> 
> > Am 06.09.2024 um 17:38 schrieb Andrew Carlotti :
> > 
> > Hi,
> > 
> > I'm working on optimising assignments to the AArch64 Floating-point Mode
> > Register (FPMR), as part of our FP8 enablement work.  Claudio has already
> > implemented FPMR as a hard register, with the intention that FP8 intrinsic
> > functions will compile to a combination of an fpmr register set, followed 
> > by an
> > FP8 operation that takes fpmr as an input operand.
> > 
> > It would clearly be inefficient to retain an explicit FPMR assignment prior 
> > to
> > each FP8 instruction (especially in the common case where every assignment 
> > uses
> > the same FPMR value).  I think the best way to optimise this would be to
> > implement a new pass that can optimise assignments to individual hard 
> > registers.
> > 
> > There are a number of existing passes that do similar optimisations, but 
> > which
> > I believe are unsuitable for this scenario for various reasons.  For 
> > example:
> > 
> > - cse1 can already optimise FPMR assignments within an extended basic block,
> >  but can't handle broader optimisations.
> > - pre (in gcse.c) doesn't work with assigning constant values, which would 
> > miss
> >  many potential usages.  It also has limits on how far code can be moved,
> >  based around ideas of register pressure that don't apply to the context of 
> > a
> >  single hard register that shouldn't be used by the register allocator for
> >  anything else.  Additionally, it doesn't run at -Os.
> > - hoist (also using gcse.c) only handles constant values, and only runs when
> >  optimising for size.  It also has the rest of the issues that pre does.
> > - mode_sw only handles a small finite set of modes.  The mode requirements 
> > are
> >  determined solely by the instructions that require the specific mode, so 
> > mode
> >  switches don't depend on the output of previous instructions.
> > 
> > 
> > My intention would be for the new pass to reuse ideas, and hopefully some of
> > the existing code, from the mode-switching and gcse passes.  In particular,
> > gcse.c (or it's dependencies) has code that could identify when values 
> > assigned
> > to the FPMR are known to be the same (although we may not need the full CSE
> > capabilities of gcse.c), and mode-switching.cc knows how to globally 
> > optimise
> > mdoe assignments (and unlike gcse.c, doesn't use cautious heuristics to 
> > avoid
> > excessively increasing register pressure).
> > 
> > Initially the new pass would only apply to the AArch64 FPMR register, but in
> > future it could also be used for other hard registers with similar 
> > properties.
> > 
> > Does anyone have any comments on this approach, before I start writing any
> > code?
> 
> Can you explain in more detail why the mode-switching pass infrastructure
> isn’t a good fit?  ISTR it already is customizable via target hooks.
> 
> Richard 
> 

I forgot to explain how FPMR is used.

The FPMR register contains a large number of fields that control the data
formats and saturation/scaling behaviour used in various fp8 conversion an
multiplication intrinsics.  At present, I think there are 2^26 valid defined
values that an be used in the FPMR.  Furthermore, these values are not always
compile-time constants - we expect that devlopers will often reuse the same
compiled code (e.g. a matrix multiplication library routine) with different
formats or scaling/saturation behaviour selected at runtime (e.g. by passing a
parameter to the library routine).

(The specification for the FPRM register can be found at [1].  It's usage in
fp8 intrinsics is described in the draft ACLE spec at [2].)

As I understand it, the existing mode-switching pass infrastructure is built
around a small number of modes, where the choice of mode is a compile time
constant, and the total number of possible modes is fixed when building GCC.
Our usage of the FPMR register does not meet any of these criteria.  I don't
see how these limitations could be overcome with target hooks within the
contraints of the existing pass.



[1] 
https://developer.arm.com/documentation/ddi0601/2024-06/AArch64-Registers/FPMR--Floating-point-Mode-Register?lang=en

[2] 
https://github.com/ARM-software/acle/pull/323/files#diff-516526d4a18101dc85300bc2033d0f86dc46c505b7510a7694baabea851aedfaR5664
^ (Expand the large main/acle.md diff to see the relevant section)