On Sat, Sep 07, 2024 at 09:09:52AM +0200, Richard Biener wrote: > > > > Am 06.09.2024 um 17:38 schrieb Andrew Carlotti <andrew.carlo...@arm.com>: > > > > Hi, > > > > I'm working on optimising assignments to the AArch64 Floating-point Mode > > Register (FPMR), as part of our FP8 enablement work. Claudio has already > > implemented FPMR as a hard register, with the intention that FP8 intrinsic > > functions will compile to a combination of an fpmr register set, followed > > by an > > FP8 operation that takes fpmr as an input operand. > > > > It would clearly be inefficient to retain an explicit FPMR assignment prior > > to > > each FP8 instruction (especially in the common case where every assignment > > uses > > the same FPMR value). I think the best way to optimise this would be to > > implement a new pass that can optimise assignments to individual hard > > registers. > > > > There are a number of existing passes that do similar optimisations, but > > which > > I believe are unsuitable for this scenario for various reasons. For > > example: > > > > - cse1 can already optimise FPMR assignments within an extended basic block, > > but can't handle broader optimisations. > > - pre (in gcse.c) doesn't work with assigning constant values, which would > > miss > > many potential usages. It also has limits on how far code can be moved, > > based around ideas of register pressure that don't apply to the context of > > a > > single hard register that shouldn't be used by the register allocator for > > anything else. Additionally, it doesn't run at -Os. > > - hoist (also using gcse.c) only handles constant values, and only runs when > > optimising for size. It also has the rest of the issues that pre does. > > - mode_sw only handles a small finite set of modes. The mode requirements > > are > > determined solely by the instructions that require the specific mode, so > > mode > > switches don't depend on the output of previous instructions. > > > > > > My intention would be for the new pass to reuse ideas, and hopefully some of > > the existing code, from the mode-switching and gcse passes. In particular, > > gcse.c (or it's dependencies) has code that could identify when values > > assigned > > to the FPMR are known to be the same (although we may not need the full CSE > > capabilities of gcse.c), and mode-switching.cc knows how to globally > > optimise > > mdoe assignments (and unlike gcse.c, doesn't use cautious heuristics to > > avoid > > excessively increasing register pressure). > > > > Initially the new pass would only apply to the AArch64 FPMR register, but in > > future it could also be used for other hard registers with similar > > properties. > > > > Does anyone have any comments on this approach, before I start writing any > > code? > > Can you explain in more detail why the mode-switching pass infrastructure > isn’t a good fit? ISTR it already is customizable via target hooks. > > Richard >
I forgot to explain how FPMR is used. The FPMR register contains a large number of fields that control the data formats and saturation/scaling behaviour used in various fp8 conversion an multiplication intrinsics. At present, I think there are 2^26 valid defined values that an be used in the FPMR. Furthermore, these values are not always compile-time constants - we expect that devlopers will often reuse the same compiled code (e.g. a matrix multiplication library routine) with different formats or scaling/saturation behaviour selected at runtime (e.g. by passing a parameter to the library routine). (The specification for the FPRM register can be found at [1]. It's usage in fp8 intrinsics is described in the draft ACLE spec at [2].) As I understand it, the existing mode-switching pass infrastructure is built around a small number of modes, where the choice of mode is a compile time constant, and the total number of possible modes is fixed when building GCC. Our usage of the FPMR register does not meet any of these criteria. I don't see how these limitations could be overcome with target hooks within the contraints of the existing pass. [1] https://developer.arm.com/documentation/ddi0601/2024-06/AArch64-Registers/FPMR--Floating-point-Mode-Register?lang=en [2] https://github.com/ARM-software/acle/pull/323/files#diff-516526d4a18101dc85300bc2033d0f86dc46c505b7510a7694baabea851aedfaR5664 ^ (Expand the large main/acle.md diff to see the relevant section)