Kyrylo Tkachov <ktkac...@nvidia.com> writes:
> Hi Richard,
>
>> On 22 Jan 2025, at 13:21, Richard Sandiford <richard.sandif...@arm.com> 
>> wrote:
>> 
>> GCC 15 is the first release to support FP8 intrinsics.
>> The underlying instructions depend on the value of a new register,
>> FPMR.  Unlike FPCR, FPMR is a normal call-clobbered/caller-save
>> register rather than a global register.  So:
>> 
>> - The FP8 intrinsics take a final uint64_t argument that
>>  specifies what value FPMR should have.
>> 
>> - If an FP8 operation is split across multiple functions,
>>  it is likely that those functions would have a similar argument.
>> 
>> If the object code has the structure:
>> 
>>    for (...)
>>      fp8_kernel (..., fpmr_value);
>> 
>> then fp8_kernel would set FPMR to fpmr_value each time it is
>> called, even though FPMR will already have that value for at
>> least the second and subsequent calls (and possibly the first).
>> 
>> The working assumption for the ABI has been that writes to
>> registers like FPMR can in general be more expensive than
>> reads and so it would be better to use a conditional write like:
>> 
>>       mrs     tmp, fpmr
>>       cmp     tmp, <value>
>>       beq     1f
>>       nsr     fpmr, <value>
>
> Typo “msr” here and in the comment in the code.

Oops, thanks, will fix.

> [...]
>> @@ -1883,6 +1884,44 @@ (define_split
>>   }
>> )
>> 
>> +;; The preferred way of writing to the FPMR is to test whether it already
>> +;; has the desired value and branch around the write if so.  This reduces
>> +;; the number of redundant FPMR writes caused by ABI boundaries, such as in:
>> +;;
>> +;;    for (...)
>> +;;      fp8_kernel (..., fpmr_value);
>> +;;
>> +;; Without this optimization, fp8_kernel would set FPMR to fpmr_value each
>> +;; time that it is called.
>> +;;
>> +;; We do this as a split so that hardreg_pre can optimize the moves first.
>> +(define_split
>> +  [(set (reg:DI FPM_REGNUM)
>> +        (match_operand:DI 0 "aarch64_reg_or_zero"))]
>> +  "TARGET_FP8 && !TARGET_CHEAP_FPMR_WRITE && can_create_pseudo_p ()"
>> +  [(const_int 0)]
>> +  {
>> +    auto label = gen_label_rtx ();
>> +    rtx current = copy_to_reg (gen_rtx_REG (DImode, FPM_REGNUM));
>> +    rtx cond = gen_rtx_EQ (VOIDmode, current, operands[0]);
>> +    emit_jump_insn (gen_cbranchdi4 (cond, current, operands[0], label));
>
> Do you think it’s worth marking this jump as likely?
> In some other expand code in the backend where we emit jumps we sometimes use 
> aarch64_emit_unlikely_jump.

Ah, yeah, I should have said that I'd wondered about that.  But in the
end it didn't seem appropriate.  Given that hardreg_pre should remove
local instances of redundancy, we don't really have much information
about whether the branch is likely or unlikely.  I think instead the
hope/expectation is that the branch has a predictable pattern.

Thanks,
Richard

Reply via email to