On Sat, Sep 07, 2024 at 09:09:52AM +0200, Richard Biener wrote:
> 
> 
> > Am 06.09.2024 um 17:38 schrieb Andrew Carlotti <andrew.carlo...@arm.com>:
> > 
> > Hi,
> > 
> > I'm working on optimising assignments to the AArch64 Floating-point Mode
> > Register (FPMR), as part of our FP8 enablement work.  Claudio has already
> > implemented FPMR as a hard register, with the intention that FP8 intrinsic
> > functions will compile to a combination of an fpmr register set, followed 
> > by an
> > FP8 operation that takes fpmr as an input operand.
> > 
> > It would clearly be inefficient to retain an explicit FPMR assignment prior 
> > to
> > each FP8 instruction (especially in the common case where every assignment 
> > uses
> > the same FPMR value).  I think the best way to optimise this would be to
> > implement a new pass that can optimise assignments to individual hard 
> > registers.
> > 
> > There are a number of existing passes that do similar optimisations, but 
> > which
> > I believe are unsuitable for this scenario for various reasons.  For 
> > example:
> > 
> > - cse1 can already optimise FPMR assignments within an extended basic block,
> >  but can't handle broader optimisations.
> > - pre (in gcse.c) doesn't work with assigning constant values, which would 
> > miss
> >  many potential usages.  It also has limits on how far code can be moved,
> >  based around ideas of register pressure that don't apply to the context of 
> > a
> >  single hard register that shouldn't be used by the register allocator for
> >  anything else.  Additionally, it doesn't run at -Os.
> > - hoist (also using gcse.c) only handles constant values, and only runs when
> >  optimising for size.  It also has the rest of the issues that pre does.
> > - mode_sw only handles a small finite set of modes.  The mode requirements 
> > are
> >  determined solely by the instructions that require the specific mode, so 
> > mode
> >  switches don't depend on the output of previous instructions.
> > 
> > 
> > My intention would be for the new pass to reuse ideas, and hopefully some of
> > the existing code, from the mode-switching and gcse passes.  In particular,
> > gcse.c (or it's dependencies) has code that could identify when values 
> > assigned
> > to the FPMR are known to be the same (although we may not need the full CSE
> > capabilities of gcse.c), and mode-switching.cc knows how to globally 
> > optimise
> > mdoe assignments (and unlike gcse.c, doesn't use cautious heuristics to 
> > avoid
> > excessively increasing register pressure).
> > 
> > Initially the new pass would only apply to the AArch64 FPMR register, but in
> > future it could also be used for other hard registers with similar 
> > properties.
> > 
> > Does anyone have any comments on this approach, before I start writing any
> > code?
> 
> Can you explain in more detail why the mode-switching pass infrastructure
> isn’t a good fit?  ISTR it already is customizable via target hooks.
> 
> Richard 
> 

I forgot to explain how FPMR is used.

The FPMR register contains a large number of fields that control the data
formats and saturation/scaling behaviour used in various fp8 conversion an
multiplication intrinsics.  At present, I think there are 2^26 valid defined
values that an be used in the FPMR.  Furthermore, these values are not always
compile-time constants - we expect that devlopers will often reuse the same
compiled code (e.g. a matrix multiplication library routine) with different
formats or scaling/saturation behaviour selected at runtime (e.g. by passing a
parameter to the library routine).

(The specification for the FPRM register can be found at [1].  It's usage in
fp8 intrinsics is described in the draft ACLE spec at [2].)

As I understand it, the existing mode-switching pass infrastructure is built
around a small number of modes, where the choice of mode is a compile time
constant, and the total number of possible modes is fixed when building GCC.
Our usage of the FPMR register does not meet any of these criteria.  I don't
see how these limitations could be overcome with target hooks within the
contraints of the existing pass.



[1] 
https://developer.arm.com/documentation/ddi0601/2024-06/AArch64-Registers/FPMR--Floating-point-Mode-Register?lang=en

[2] 
https://github.com/ARM-software/acle/pull/323/files#diff-516526d4a18101dc85300bc2033d0f86dc46c505b7510a7694baabea851aedfaR5664
    ^ (Expand the large main/acle.md diff to see the relevant section)

Reply via email to