Kyrylo Tkachov <ktkac...@nvidia.com> writes: > Hi Richard, > >> On 22 Jan 2025, at 13:21, Richard Sandiford <richard.sandif...@arm.com> >> wrote: >> >> GCC 15 is the first release to support FP8 intrinsics. >> The underlying instructions depend on the value of a new register, >> FPMR. Unlike FPCR, FPMR is a normal call-clobbered/caller-save >> register rather than a global register. So: >> >> - The FP8 intrinsics take a final uint64_t argument that >> specifies what value FPMR should have. >> >> - If an FP8 operation is split across multiple functions, >> it is likely that those functions would have a similar argument. >> >> If the object code has the structure: >> >> for (...) >> fp8_kernel (..., fpmr_value); >> >> then fp8_kernel would set FPMR to fpmr_value each time it is >> called, even though FPMR will already have that value for at >> least the second and subsequent calls (and possibly the first). >> >> The working assumption for the ABI has been that writes to >> registers like FPMR can in general be more expensive than >> reads and so it would be better to use a conditional write like: >> >> mrs tmp, fpmr >> cmp tmp, <value> >> beq 1f >> nsr fpmr, <value> > > Typo “msr” here and in the comment in the code.
Oops, thanks, will fix. > [...] >> @@ -1883,6 +1884,44 @@ (define_split >> } >> ) >> >> +;; The preferred way of writing to the FPMR is to test whether it already >> +;; has the desired value and branch around the write if so. This reduces >> +;; the number of redundant FPMR writes caused by ABI boundaries, such as in: >> +;; >> +;; for (...) >> +;; fp8_kernel (..., fpmr_value); >> +;; >> +;; Without this optimization, fp8_kernel would set FPMR to fpmr_value each >> +;; time that it is called. >> +;; >> +;; We do this as a split so that hardreg_pre can optimize the moves first. >> +(define_split >> + [(set (reg:DI FPM_REGNUM) >> + (match_operand:DI 0 "aarch64_reg_or_zero"))] >> + "TARGET_FP8 && !TARGET_CHEAP_FPMR_WRITE && can_create_pseudo_p ()" >> + [(const_int 0)] >> + { >> + auto label = gen_label_rtx (); >> + rtx current = copy_to_reg (gen_rtx_REG (DImode, FPM_REGNUM)); >> + rtx cond = gen_rtx_EQ (VOIDmode, current, operands[0]); >> + emit_jump_insn (gen_cbranchdi4 (cond, current, operands[0], label)); > > Do you think it’s worth marking this jump as likely? > In some other expand code in the backend where we emit jumps we sometimes use > aarch64_emit_unlikely_jump. Ah, yeah, I should have said that I'd wondered about that. But in the end it didn't seem appropriate. Given that hardreg_pre should remove local instances of redundancy, we don't really have much information about whether the branch is likely or unlikely. I think instead the hope/expectation is that the branch has a predictable pattern. Thanks, Richard