Jeff Law <jeffreya...@gmail.com> writes:
> On 9/7/24 1:09 AM, Richard Biener wrote:
>> 
>> 
>>> Am 06.09.2024 um 17:38 schrieb Andrew Carlotti <andrew.carlo...@arm.com>:
>>>
>>> Hi,
>>>
>>> I'm working on optimising assignments to the AArch64 Floating-point Mode
>>> Register (FPMR), as part of our FP8 enablement work.  Claudio has already
>>> implemented FPMR as a hard register, with the intention that FP8 intrinsic
>>> functions will compile to a combination of an fpmr register set, followed 
>>> by an
>>> FP8 operation that takes fpmr as an input operand.
>>>
>>> It would clearly be inefficient to retain an explicit FPMR assignment prior 
>>> to whic
>>> each FP8 instruction (especially in the common case where every assignment 
>>> uses
>>> the same FPMR value).  I think the best way to optimise this would be to
>>> implement a new pass that can optimise assignments to individual hard 
>>> registers.
>>>
>>> There are a number of existing passes that do similar optimisations, but 
>>> which
>>> I believe are unsuitable for this scenario for various reasons.  For 
>>> example:
>>>
>>> - cse1 can already optimise FPMR assignments within an extended basic block,
>>>   but can't handle broader optimisations.
>>> - pre (in gcse.c) doesn't work with assigning constant values, which would 
>>> miss
>>>   many potential usages.  It also has limits on how far code can be moved,
>>>   based around ideas of register pressure that don't apply to the context 
>>> of a
>>>   single hard register that shouldn't be used by the register allocator for
>>>   anything else.  Additionally, it doesn't run at -Os.
>>> - hoist (also using gcse.c) only handles constant values, and only runs when
>>>   optimising for size.  It also has the rest of the issues that pre does.
>>> - mode_sw only handles a small finite set of modes.  The mode requirements 
>>> are
>>>   determined solely by the instructions that require the specific mode, so 
>>> mode
>>>   switches don't depend on the output of previous instructions.
>>>
>>>
>>> My intention would be for the new pass to reuse ideas, and hopefully some of
>>> the existing code, from the mode-switching and gcse passes.  In particular,
>>> gcse.c (or it's dependencies) has code that could identify when values 
>>> assigned
>>> to the FPMR are known to be the same (although we may not need the full CSE
>>> capabilities of gcse.c), and mode-switching.cc knows how to globally 
>>> optimise
>>> mdoe assignments (and unlike gcse.c, doesn't use cautious heuristics to 
>>> avoid
>>> excessively increasing register pressure).
>>>
>>> Initially the new pass would only apply to the AArch64 FPMR register, but in
>>> future it could also be used for other hard registers with similar 
>>> properties.
>>>
>>> Does anyone have any comments on this approach, before I start writing any
>>> code?
>> 
>> Can you explain in more detail why the mode-switching pass
> infrastructure isn’t a good fit?  ISTR it already is customizable via
> target hooks.
> Agreed.  Mode switching seems to be the right pass to look at.
>
> It probably is worth pointing out that mode switching is LCM based and 
> as such never speculates.  Given the potential cost of a mode switch, 
> failure to speculate may be a notable limitation (though the same would 
> apply to the ideas Andrew floated above).
>
> This has recently come up in the RISC-V space due to needing VXRM 
> assignments so that we can utilize the vaaddu add-with-averaging 
> instructions.    Placement of VXRM mode switches looks optimal from an 
> LCM standpoint, but speculation can measurably improve performance.  It 
> was something like 2% on the BPI for x264.  The k1/m1 chip in the BPI is 
> almost certainly flushing its pipelines on the VXRM assignment.

Ah yeah, good point.  I expect speculation would be best for FPMR as well.
I imagine most use cases will be well-structured in practice, but for
those that aren't...

> I've got a hack here that I'll submit upstream at some point.  Just not 
> at the top of my list yet -- especially now that our uarch has been 
> fixed to not flush its pipelines at VXRM assignments ;-)

Is that handled by mode-switching, or is it a separate thing?

RIchard

Reply via email to