Re: [2/2] RISC-V: Constant FP Optimization with 'Zfa'

Tsukasa OI via Gcc-patches Mon, 14 Aug 2023 20:39:13 -0700

On 2023/08/14 21:51, Jin Ma wrote:
> Hi Tsukasa,
>   What a coincidence, I also implemented zfa extension, which also includes 
> fli related instructions :)


Hi, I'm glad to know that someone is working on this extension more
comprehensively (especially when "someone" is an experienced GCC
contributor).  I prefer your patch set in general and glad to learn from
your patch set and your response that my approach was not *that* bad as
I expected.

When a new extension gets available, I will be more confident making a
patch set for GCC (as I already do in GNU Binutils).

> 
> links: https://gcc.gnu.org/pipermail/gcc-patches/2023-August/627294.html
> 
>> +  if (!TARGET_HARD_FLOAT || !TARGET_ZFA)
>> +    return result;
>> +  switch (GET_MODE (x))
>> +    {
>> +    case HFmode:
>> +      /* Not only 'Zfhmin', either 'Zfh' or 'Zvfh' is required.  */
>> +      if (!TARGET_ZFH && !TARGET_ZVFH)
> 
> When Zvfh means that zfh is also on, so there may be no need to judge
> the TARGET_ZVFH here. By the way,the format here seems wrong, maybe 'tab'
> is needed for alignment?

For indentation, I believe this is okay considering 3 indent (soft tab)
from the top (meaning 6 spaces).

For specification requirements, I think I'm correct.

The spec says that 'Zvfh' depends on 'Zve32f' and 'Zfhmin'.  'Zfhmin' is
a conversion-only 'Zfh' subset ('Zve32f' doesn't require any
FP16-related extensions).

Note that "fli.h" requires 'Zfa' and ('Zfh' and/or 'Zvfh').

So, 'Zfh' alone will not be sufficient to check requirements to the
"fli.h" instruction.  So, checking TARGET_ZFH || TARGET_ZVFH (for
existence of the "fli.h") should be correct and I think your patch needs
to be changed "in the long term".

"In the long term" means that, current GNU Binutils has a bug which
"fli.h" requires 'Zfa' and 'Zfh' ('Zfa' and 'Zvfh' does not work).
My initial 'Zfa' proposal (improved by Christoph Müllner and upstreamed
into master) intentionally ignored this case because I assumed that
approval/ratification of 'Zvfh' will take some time and we have a time
to fix before a release of Binutils following approval of both 'Zfa' and
'Zvfh' (it turned out to be wrong).

cf. <https://sourceware.org/pipermail/binutils/2023-August/129006.html>

So, "fixing" this part (on your patch) alone will not make the program
work (on the simulator) because current buggy GNU Binutils won't accept
it.  I'm working on it on the GNU Binutils side.

> 
>> +    return result;
>> +      break;
>> +    case SFmode: break;
>> +    case DFmode: break;
> 
> Maybe we still have to judge TARGET_DOUBLE_FLOAT?

Indeed.  I just missed that.

> 
>> +    default: return result;
>> +    }
>> +
>> +  if (!CONST_DOUBLE_P (x))
>> +    return result;
> 
> I think it might be better to judge whether x satisfies the CONST_DOUBLE_P
> before switch (GET_MODE (x)) above.

That's correct.  I think that's a part of leftover when I'm experimenting.

> 
>> +
>> +  r = *CONST_DOUBLE_REAL_VALUE (x);
>> +
>> +  if (REAL_VALUE_ISNAN (r))
>> +    {
>> +      long reprs[2] = { 0 };
>> +      /* Compare with canonical NaN.  */
>> +      switch (GET_MODE (x))
>> +    {
>> +    case HFmode:
>> +      reprs[0] = real_to_target (NULL, &r,
>> +                                 float_mode_for_size (16).require ());
>> +      /* 0x7e00: Canonical NaN for binary16.  */
>> +      if (reprs[0] != 0x7e00)
>> +        return result;
>> +      break;
>> +    case SFmode:
>> +      reprs[0] = real_to_target (NULL, &r,
>> +                                 float_mode_for_size (32).require ());
>> +      /* 0x7fc00000: Canonical NaN for binary32.  */
>> +      if (reprs[0] != 0x7fc00000)
>> +        return result;
>> +      break;
>> +    case DFmode:
>> +      real_to_target (reprs, &r, float_mode_for_size (64).require ());
>> +      if (FLOAT_WORDS_BIG_ENDIAN)
>> +        std::swap (reprs[0], reprs[1]);
>> +      /* 0x7ff80000_00000000: Canonical NaN for binary64.  */
>> +      if (reprs[0] != 0 || reprs[1] != 0x7ff80000)
>> +        return result;
>> +      break;
>> +    default:
>> +      gcc_unreachable ();
>> +    }
>> +      result.type = RISCV_FLOAT_CONST_NAN;
>> +      result.valid = true;
>> +      return result;
>> +    }
>> +  else if (REAL_VALUE_ISINF (r))
>> +    {
>> +      if (REAL_VALUE_NEGATIVE (r))
>> +    return result;
>> +      result.type = RISCV_FLOAT_CONST_INF;
>> +      result.valid = true;
>> +      return result;
>> +    }
>> +
>> +  bool sign = REAL_VALUE_NEGATIVE (r);
>> +  result.sign = sign;
>> +
>> +  r = real_value_abs (&r);
>> +  /* GCC internally does not use IEEE754-like encoding (where normalized
>> +     significands are in the range [1, 2).  GCC uses [0.5, 1) (see real.cc).
>> +     So, this exponent_p1 variable equals IEEE754 unbiased exponent + 1.  */
>> +  int exponent_p1 = REAL_EXP (&r);
>> +
>> +  /* For the mantissa, we expand into two HOST_WIDE_INTS, apart from the
>> +     highest (sign) bit, with a fixed binary point at bit point_pos.
>> +     m1 holds the low part of the mantissa, m2 the high part.
>> +     WARNING: If we ever have a representation using more than 2 * H_W_I - 1
>> +     bits for the mantissa, this can fail (low bits will be lost).  */
>> +  bool fail = false;
>> +  real_ldexp (&m, &r, (2 * HOST_BITS_PER_WIDE_INT - 1) - exponent_p1);
>> +  wide_int w = real_to_integer (&m, &fail, HOST_BITS_PER_WIDE_INT * 2);
>> +  if (fail)
>> +    return result;
>> +
>> +  /* If the low part of the mantissa has bits set we cannot represent
>> +     the value.  */
>> +  if (w.ulow () != 0)
>> +    return result;
>> +  /* We have rejected the lower HOST_WIDE_INT, so update our
>> +     understanding of how many bits lie in the mantissa and
>> +     look only at the high HOST_WIDE_INT.  */
>> +  unsigned HOST_WIDE_INT mantissa = w.elt (1);
>> +
>> +  /* We cannot represent the value 0.0.  */
>> +  if (mantissa == 0)
>> +    return result;
>> +
>> +  /* We can only represent values with a mantissa of the form 1.xx.  */
>> +  unsigned HOST_WIDE_INT mask
>> +      = ((unsigned HOST_WIDE_INT) 1 << (HOST_BITS_PER_WIDE_INT - 4)) - 1;
>> +  if ((mantissa & mask) != 0)
>> +    return result;
>> +  mantissa >>= HOST_BITS_PER_WIDE_INT - 4;
>> +  /* Now the lowest 3-bits of mantissa should form (1.xx)b.  */
>> +  gcc_assert (mantissa & (1u << 2));
>> +  /* Mask out the highest bit.  */
>> +  mantissa &= ~(1u << 2);
>> +
>> +  if (mantissa == 0)
>> +    {
>> +      /* We cannot represent any values but -1.0.  */
>> +      if (exponent_p1 != 1 && sign)
>> +    return result;
>> +      switch (exponent_p1)
>> +    {
>> +    case -15: /* 1.0 * 2^(-16)  */
>> +    case -14: /* 1.0 * 2^(-15)  */
>> +    case -7:  /* 1.0 * 2^(- 8)  */
>> +    case -6:  /* 1.0 * 2^(- 7)  */
>> +    case 8:   /* 1.0 * 2^(+ 7)  */
>> +    case 9:   /* 1.0 * 2^(+ 8)  */
>> +    case 16:  /* 1.0 * 2^(+15)  */
>> +    case 17:  /* 1.0 * 2^(+16)  */
>> +      break;
>> +    default:
>> +      if (exponent_p1 >= -3 && exponent_p1 <= 5)
>> +        /* 1.0 * 2^[-4,4]  */
>> +        break;
>> +      switch (GET_MODE (x))
>> +        {
>> +        case HFmode: /* IEEE 754 binary16.  */
>> +          /* Minimum positive normal == 1.0 * 2^(-14)  */
>> +          if (exponent_p1 != -13) return result;
>> +          break;
>> +        case SFmode: /* IEEE 754 binary32.  */
>> +          /* Minimum positive normal == 1.0 * 2^(-126)  */
>> +          if (exponent_p1 != -125) return result;
>> +          break;
>> +        case DFmode: /* IEEE 754 binary64.  */
>> +          /* Minimum positive normal == 1.0 * 2^(-1022)  */
>> +          if (exponent_p1 != -1021) return result;
>> +          break;
>> +        default:
>> +          gcc_unreachable ();
>> +        }
>> +      result.type = RISCV_FLOAT_CONST_MIN;
>> +      result.valid = true;
>> +      return result;
>> +    }
>> +    }
>> +  else
>> +    {
>> +      if (sign)
>> +    return result;
>> +      if (exponent_p1 < -1 || exponent_p1 > 2)
>> +    return result;
>> +      /* The value is (+1.xx)b * 2^[-2,1].
>> +     But we cannot represent (+1.11)b * 2^1 (that is 3.5). */
>> +      if (exponent_p1 == 2 && mantissa == 3)
>> +    return result;
>> +    }
>> +
>> +  result.valid = true;
>> +  result.mantissa_below_point = mantissa;
>> +  result.biased_exponent = exponent_p1 + 15;
>> +
>> +  return result;
>> +}
>> +
> 
> This code is great and completely different from the way I implemented it.
> I'm not sure which one is better, but my idea is that the fli instruction
> corresponds to three tables (HF, SF and DF), all of which represent
> specific values. the library in gcc's real.h can very well convert
> the corresponding values into the values in the table, so it is only
> necessary to perform a simple binary search to look up the tables.

Yup.  My approach (based on AArch64's VMOV.F32 constraint checking code)
is more generic but I think constants with single FLI instruction don't
need to be that generic.

If multi-instruction FLI sequence gets realistic, this kind of generic
approach (handling finite constants precisely) will be helpful (multi
FLI sequence with addition might need some additional measures to avoid
underflow, though).  But for now, I think your approach is better and
simpler.

> 
> @@ -1362,17 +1545,14 @@  riscv_const_insns (rtx x)
>                  constant incurs a literal-pool access.  Allow this in
>                  order to increase vectorization possibilities.  */
>               int n = riscv_const_insns (elt);
> -             if (CONST_DOUBLE_P (elt))
> -                 return 1 + 4; /* vfmv.v.f + memory access.  */
>> +            /* We need as many insns as it takes to load the constant
>> +               into a GPR and one vmv.v.x.  */
>> +            if (n != 0)
>> +              return 1 + n;
>> +            else if (CONST_DOUBLE_P (elt))
>> +              return 1 + 4; /* vfmv.v.f + memory access.  */
>               else
> -               {
> -                 /* We need as many insns as it takes to load the constant
> -                    into a GPR and one vmv.v.x.  */
> -                 if (n != 0)
> -                   return 1 + n;
> -                 else
> -                   return 1 + 4; /*vmv.v.x + memory access.  */
> -               }
>> +              return 1 + 4; /* vmv.v.x + memory access.  */
>             }
>         }
> 
> I don't seem to understand here, if n = = 0, always return 1 + 4?
> If so, it could be
> if (n != 0)
>    return 1 + n;
> else
>   return 1 + 4;
> 
> @@ -5117,6 +5313,36 @@  riscv_print_operand (FILE *file, rtx op, int letter)
>           output_address (mode, XEXP (op, 0));
>         break;
>  
>> +    case CONST_DOUBLE:
>> +      {
>> +        struct riscv_float_fli_const flt = riscv_get_float_fli_const (op);
>> +        if (flt.valid)
>> +          {
>> +            switch (flt.type)
>> +              {
>> +              case RISCV_FLOAT_CONST_MIN:
>> +                fputs ("min", file);
>> +                break;
>> +              case RISCV_FLOAT_CONST_INF:
>> +                fputs ("inf", file);
>> +                break;
>> +              case RISCV_FLOAT_CONST_NAN:
>> +                fputs ("nan", file);
>> +                break;
>> +              default:
>> +                /* Use simpler (and bit-perfect) printer.  */
>> +                if (flt.sign)
>> +                  fputc ('-', file);
>> +                fprintf (file, "0x1.%cp%+d",
>> +                         "048c"[flt.mantissa_below_point],
>> +                         (int) flt.biased_exponent - 16);
>> +                break;
>> +              }
>> +            break;
>> +          }
>> +      }
>> +      /* Fall through.  */
> 
> Display floating-point values at the assembly level can refer llvm
> https://reviews.llvm.org/D145645. 

Thanks for the link.  I personally prefer hexfloats to avoid precision
problems as possible and that's how GNU Binutils prints the FLI
constants.  But that makes sense (and I feel decimals are okay).

> 
> It may also be necessary to deal with riscv_split_64bit_move_p
> and riscv_legitimize_const_move for rv32, otherwise the mov of
> DFmode on rv32 will be split into high 32-bit mov and low 32-bit
> mov, thus unable to generate fli instructions.

Thanks for letting me know.  I'm fighting against GCC's large code base
for future contribution and that's a lot of help for me.

Thanks,
Tsukasa

> 
> Thanks,
> Jin
>

Re: [2/2] RISC-V: Constant FP Optimization with 'Zfa'

Reply via email to