Thanks for the review, Will get started on it but one question... > -----Original Message----- > From: Richard Sandiford <richard.sandif...@arm.com> > Sent: Monday, September 30, 2024 6:33 PM > To: Tamar Christina <tamar.christ...@arm.com> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw > <richard.earns...@arm.com>; Marcus Shawcroft > <marcus.shawcr...@arm.com>; ktkac...@gcc.gnu.org > Subject: Re: [PATCH 2/2]AArch64: support encoding integer immediates using > floating point moves > > Tamar Christina <tamar.christ...@arm.com> writes: > > Hi All, > > > > This patch extends our immediate SIMD generation cases to support generating > > integer immediates using floating point operation if the integer immediate > > maps > > to an exact FP value. > > > > As an example: > > > > uint32x4_t f1() { > > return vdupq_n_u32(0x3f800000); > > } > > > > currently generates: > > > > f1: > > adrp x0, .LC0 > > ldr q0, [x0, #:lo12:.LC0] > > ret > > > > i.e. a load, but with this change: > > > > f1: > > fmov v0.4s, 1.0e+0 > > ret > > > > Such immediates are common in e.g. our Math routines in glibc because they > > are > > created to extract or mark part of an FP immediate as masks. > > I agree this is a good thing to do. The current code is too beholden > to the original vector mode. This patch relaxes it so that it isn't > beholden to the original mode's class (integer vs. float), but it would > still be beholden to the original mode's element size. > > It looks like an alternative would be to remove: > > scalar_float_mode elt_float_mode; > if (n_elts == 1 > && is_a <scalar_float_mode> (elt_mode, &elt_float_mode)) > { > rtx elt = CONST_VECTOR_ENCODED_ELT (op, 0); > if (aarch64_float_const_zero_rtx_p (elt) > || aarch64_float_const_representable_p (elt)) > { > if (info) > *info = simd_immediate_info (elt_float_mode, elt); > return true; > } > } > > and instead insert code: > > /* Get the repeating 8-byte value as an integer. No endian correction > is needed here because bytes is already in lsb-first order. */ > unsigned HOST_WIDE_INT val64 = 0; > for (unsigned int i = 0; i < 8; i++) > val64 |= ((unsigned HOST_WIDE_INT) bytes[i % nbytes] > << (i * BITS_PER_UNIT)); > > ---> here > > if (vec_flags & VEC_SVE_DATA) > return aarch64_sve_valid_immediate (val64, info); > else > return aarch64_advsimd_valid_immediate (val64, info, which); > > that tries to reduce val64 to the smallest repeating pattern, > then tries to interpret that pattern as a float. The reduction step > could reuse the first part of aarch64_sve_valid_immediate, which > calculates the narrowest repeating integer mode: > > scalar_int_mode mode = DImode; > unsigned int val32 = val64 & 0xffffffff; > if (val32 == (val64 >> 32)) > { > mode = SImode; > unsigned int val16 = val32 & 0xffff; > if (val16 == (val32 >> 16)) > { > mode = HImode; > unsigned int val8 = val16 & 0xff; > if (val8 == (val16 >> 8)) > mode = QImode; > } > } > > This would give us the candidate integer mode, to which we could > apply float_mode_for_size (...).exists, as in the patch. >
I was doubting whether it's safe to use this or not. That's why I listed the modes using a switch statement. Namely I'm concerned about the multiple float 16 format. It looks like from looking at the source of float_mode_for_size that it just returns the first float mode, so makes it pretty sensitive to the order of definition in aarch64/aarch64-modes.def. Is it safe to assume that storage only formats like BF16 will always be listed after general compute types? Thanks, Tamar > In this case we would have the value as an integer, rather than > as an rtx, so I think it would make sense to split out the part of > aarch64_float_const_representable_p that processes the REAL_VALUE_TYPE. > aarch64_simd_valid_immediate could then use the patch's: > > > + long int as_long_ints[2]; > > + as_long_ints[0] = buf & 0xFFFFFFFF; > > + as_long_ints[1] = (buf >> 32) & 0xFFFFFFFF; > > [...] > > + real_from_target (&r, as_long_ints, fmode); > > with "buf" being "val64" in the code above, and "fmode" being the result > of float_mode_for_size (...).exists. aarch64_simd_valid_immediate > would then pass "r" and and "fmode" to the new, split-out variant of > aarch64_float_const_representable_p. (I haven't checked the endiannes > requirements for real_from_target.) > > The split-out variant would still perform the HFmode test in: > > if (GET_MODE (x) == VOIDmode > || (GET_MODE (x) == HFmode && !TARGET_FP_F16INST)) > return false; > > The VOIDmode test is redundant and can be dropped. AArch64 has always > been a CONST_WIDE_INT target. > > If we do that, we should probably also pass the integer mode calculated > by the code quoted above down to aarch64_sve_valid_immediate (where it > came from) and aarch64_advsimd_valid_immediate, since both of them would > find it useful. E.g.: > > /* Try using a replicated byte. */ > if (which == AARCH64_CHECK_MOV > && val16 == (val32 >> 16) > && val8 == (val16 >> 8)) > { > if (info) > *info = simd_immediate_info (QImode, val8); > return true; > } > > would become: > > /* Try using a replicated byte. */ > if (which == AARCH64_CHECK_MOV && mode == QImode) > { > if (info) > *info = simd_immediate_info (QImode, val8); > return true; > } > > I realise that's quite a bit different from the patch as posted, sorry, > and I've made it sound more complicated than it actually is. But I think > it should be both more general (because it ignores the element size as > well as the mode class) and a little simpler. > > The proposed split of aarch64_float_const_representable_p would be > a replacement for patch 1 in the series. The current rtx version > of aarch64_float_const_representable_p would not need to take a mode, > but the REAL_VALUE_TYPE interface would. > > Thanks, > Richard > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu and <on-goin> issues. > > > > Ok for master? > > > > Thanks, > > Tamar > > > > gcc/ChangeLog: > > > > * config/aarch64/aarch64-protos.h > (aarch64_float_const_representable_p): > > Add overload. > > * config/aarch64/aarch64.cc (aarch64_float_const_zero_rtx_p): Reject > > integer modes. > > (aarch64_simd_valid_immediate, aarch64_float_const_representable_p): > > Check if integer value maps to an exact FP constant. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/aarch64/const_create_using_fmov.c: New test. > > > > --- > > > > diff --git a/gcc/config/aarch64/aarch64-protos.h > b/gcc/config/aarch64/aarch64-protos.h > > index > 7a84acc59569da0b50af2300615db561a5de460a..6c683ea2d93e1b733cfe49fac > 38381ea6451fd55 100644 > > --- a/gcc/config/aarch64/aarch64-protos.h > > +++ b/gcc/config/aarch64/aarch64-protos.h > > @@ -974,6 +974,7 @@ void aarch64_split_simd_move (rtx, rtx); > > > > /* Check for a legitimate floating point constant for FMOV. */ > > bool aarch64_float_const_representable_p (rtx, machine_mode); > > +bool aarch64_float_const_representable_p (rtx *, rtx, machine_mode); > > > > extern int aarch64_epilogue_uses (int); > > > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > > index > 1842f6ecf6330f11a64545d0903240c89b104ffc..2d44608d93b8e7542ea8d5eb > 4c3f99c9f88e70ed 100644 > > --- a/gcc/config/aarch64/aarch64.cc > > +++ b/gcc/config/aarch64/aarch64.cc > > @@ -10991,7 +10991,8 @@ aarch64_float_const_zero_rtx_p (rtx x) > > /* 0.0 in Decimal Floating Point cannot be represented by #0 or > > zr as our callers expect, so no need to check the actual > > value if X is of Decimal Floating Point type. */ > > - if (GET_MODE_CLASS (GET_MODE (x)) == MODE_DECIMAL_FLOAT) > > + if (GET_MODE_CLASS (GET_MODE (x)) == MODE_DECIMAL_FLOAT > > + || !CONST_DOUBLE_P (x)) > > return false; > > > > if (REAL_VALUE_MINUS_ZERO (*CONST_DOUBLE_REAL_VALUE (x))) > > @@ -23026,17 +23027,30 @@ aarch64_simd_valid_immediate (rtx op, > simd_immediate_info *info, > > else > > return false; > > > > - scalar_float_mode elt_float_mode; > > - if (n_elts == 1 > > - && is_a <scalar_float_mode> (elt_mode, &elt_float_mode)) > > + if (n_elts == 1) > > { > > rtx elt = CONST_VECTOR_ENCODED_ELT (op, 0); > > + rtx new_elt = NULL_RTX; > > if (aarch64_float_const_zero_rtx_p (elt) > > - || aarch64_float_const_representable_p (elt, elt_mode)) > > - { > > - if (info) > > - *info = simd_immediate_info (elt_float_mode, elt); > > - return true; > > + || aarch64_float_const_representable_p (&new_elt, elt, elt_mode)) > > + { > > + scalar_float_mode elt_float_mode; > > + auto bitsize = GET_MODE_UNIT_BITSIZE (elt_mode); > > + if (is_a <scalar_float_mode> (elt_mode)) > > + elt_float_mode = as_a <scalar_float_mode> (elt_mode); > > + else if (which == AARCH64_CHECK_MOV > > + && new_elt > > + && float_mode_for_size (bitsize).exists (&elt_float_mode)) > > + elt = new_elt; > > + else > > + elt = NULL_RTX; > > + > > + if (elt != NULL_RTX) > > + { > > + if (info) > > + *info = simd_immediate_info (elt_float_mode, elt); > > + return true; > > + } > > } > > } > > > > @@ -25121,8 +25135,22 @@ aarch64_c_mode_for_suffix (char suffix) > > > > /* Return true iff X with mode MODE can be represented by a > > quarter-precision > > floating point immediate operand X. Note, we cannot represent 0.0. */ > > + > > bool > > aarch64_float_const_representable_p (rtx x, machine_mode mode) > > +{ > > + return aarch64_float_const_representable_p (NULL, x, mode); > > +} > > + > > + > > +/* Return true iff X with mode MODE can be represented by a > > quarter-precision > > + floating point immediate operand X. Note, we cannot represent 0.0. > > + If the value is a CONST_INT that can be represented as an exact floating > > + point then OUT will contain the new floating point value to emit to > > generate > > + the integer constant. */ > > + > > +bool > > +aarch64_float_const_representable_p (rtx *out, rtx x, machine_mode mode) > > { > > /* This represents our current view of how many bits > > make up the mantissa. */ > > @@ -25134,14 +25162,45 @@ aarch64_float_const_representable_p (rtx x, > machine_mode mode) > > > > x = unwrap_const_vec_duplicate (x); > > mode = GET_MODE_INNER (mode); > > - if (!CONST_DOUBLE_P (x)) > > + if (!CONST_DOUBLE_P (x) > > + && !CONST_INT_P (x)) > > return false; > > > > if (mode == VOIDmode > > - || (mode == HFmode && !TARGET_FP_F16INST)) > > + || ((mode == HFmode || mode == HImode) && !TARGET_FP_F16INST)) > > return false; > > > > - r = *CONST_DOUBLE_REAL_VALUE (x); > > + /* If we have an integer bit pattern, decode it back into a real. > > + real_from_target requires the representation to be split into > > + 32-bit values and then put into two host wide ints. */ > > + if (CONST_INT_P (x)) > > + { > > + HOST_WIDE_INT buf = INTVAL (x); > > + long int as_long_ints[2]; > > + as_long_ints[0] = buf & 0xFFFFFFFF; > > + as_long_ints[1] = (buf >> 32) & 0xFFFFFFFF; > > + machine_mode fmode; > > + switch (mode) > > + { > > + case HImode: > > + fmode = HFmode; > > + break; > > + case SImode: > > + fmode = SFmode; > > + break; > > + case DImode: > > + fmode = DFmode; > > + break; > > + default: > > + return false; > > + } > > + > > + real_from_target (&r, as_long_ints, fmode); > > + if (out) > > + *out = const_double_from_real_value (r, fmode); > > + } > > + else > > + r = *CONST_DOUBLE_REAL_VALUE (x); > > > > /* We cannot represent infinities, NaNs or +/-zero. We won't > > know if we have +zero until we analyse the mantissa, but we > > @@ -25170,6 +25229,7 @@ aarch64_float_const_representable_p (rtx x, > machine_mode mode) > > the value. */ > > if (w.ulow () != 0) > > return false; > > + > > /* We have rejected the lower HOST_WIDE_INT, so update our > > understanding of how many bits lie in the mantissa and > > look only at the high HOST_WIDE_INT. */ > > @@ -25205,9 +25265,9 @@ aarch64_float_const_representable_p (rtx x, > machine_mode mode) > > return (exponent >= 0 && exponent <= 7); > > } > > > > -/* Returns the string with the instruction for AdvSIMD MOVI, MVNI, ORR or > > BIC > > - immediate with a CONST_VECTOR of MODE and WIDTH. WHICH selects > whether to > > - output MOVI/MVNI, ORR or BIC immediate. */ > > +/* Returns the string with the instruction for AdvSIMD MOVI, MVNI, ORR, BIC > or > > + FMOV immediate with a CONST_VECTOR of MODE and WIDTH. WHICH > selects whether > > + to output MOVI/MVNI, ORR or BIC immediate. */ > > char* > > aarch64_output_simd_mov_immediate (rtx const_vector, unsigned width, > > enum simd_immediate_check which) > > diff --git a/gcc/testsuite/gcc.target/aarch64/const_create_using_fmov.c > b/gcc/testsuite/gcc.target/aarch64/const_create_using_fmov.c > > new file mode 100644 > > index > 0000000000000000000000000000000000000000..e080afed8aa35786600279 > 79335bfc859ca6bc91 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/aarch64/const_create_using_fmov.c > > @@ -0,0 +1,87 @@ > > +/* { dg-do compile } */ > > +/* { dg-additional-options "-march=armv9-a -Ofast" } */ > > +/* { dg-final { check-function-bodies "**" "" "" } } */ > > + > > +#include <arm_neon.h> > > + > > +/* > > +** g: > > +** fmov v0\.4s, 1\.0e\+0 > > +** ret > > +*/ > > +float32x4_t g(){ > > + return vdupq_n_f32(1); > > +} > > + > > +/* > > +** h: > > +** fmov v0\.4s, 1\.0e\+0 > > +** ret > > +*/ > > +uint32x4_t h() { > > + return vreinterpretq_u32_f32(g()); > > +} > > + > > +/* > > +** f1: > > +** fmov v0\.4s, 1\.0e\+0 > > +** ret > > +*/ > > +uint32x4_t f1() { > > + return vdupq_n_u32(0x3f800000); > > +} > > + > > +/* > > +** f2: > > +** fmov v0\.4s, 1\.5e\+0 > > +** ret > > +*/ > > +uint32x4_t f2() { > > + return vdupq_n_u32(0x3FC00000); > > +} > > + > > +/* > > +** f3: > > +** fmov v0\.4s, 1\.25e\+0 > > +** ret > > +*/ > > +uint32x4_t f3() { > > + return vdupq_n_u32(0x3FA00000); > > +} > > + > > +/* > > +** f4: > > +** fmov v0\.2d, 1\.0e\+0 > > +** ret > > +*/ > > +uint64x2_t f4() { > > + return vdupq_n_u64(0x3FF0000000000000); > > +} > > + > > +/* > > +** fn4: > > +** fmov v0\.2d, -1\.0e\+0 > > +** ret > > +*/ > > +uint64x2_t fn4() { > > + return vdupq_n_u64(0xBFF0000000000000); > > +} > > + > > +/* > > +** f5: > > +** fmov v0\.8h, 1\.5e\+0 > > +** ret > > +*/ > > +uint16x8_t f5() { > > + return vdupq_n_u16(0x3E00); > > +} > > + > > +/* > > +** f6: > > +** adrp x0, \.LC0 > > +** ldr q0, \[x0, #:lo12:\.LC0\] > > +** ret > > +*/ > > +uint32x4_t f6() { > > + return vdupq_n_u32(0x4f800000); > > +}