Re: [GCC][PATCH][Aarch64] Add Bfloat16_t scalar type, vector types and machine modes to Aarch64 back-end [1/2]

Richard Sandiford Mon, 23 Dec 2019 08:57:39 -0800

Stam Markianos-Wright <stam.markianos-wri...@arm.com> writes:
> On 12/19/19 10:01 AM, Richard Sandiford wrote:
>>> +
>>> +#pragma GCC push_options
>>> +#pragma GCC target ("arch=armv8.2-a+bf16")
>>> +#ifdef __ARM_FEATURE_BF16_SCALAR_ARITHMETIC
>>> +
>>> +typedef __bf16 bfloat16_t;
>>> +
>>> +
>>> +#endif
>>> +#pragma GCC pop_options
>>> +
>>> +#endif
>> 
>> Are you sure we need the #ifdef?  The target pragma should guarantee
>> that the macro's defined.
>> 
>> But the validity of the typedef shouldn't depend on target options,
>> so AFAICT this should just be:
>> 
>> typedef __bf16 bfloat16_t;
>
> Ok so it's a case of "what do we want to happen if the user tries to use 
> bfloats 
> without +bf16 enabled.
>
> So the intent of the ifdef was to not have bfloat16_t be visible if the macro 
> wasn't defined (i.e. not having any bf16 support), but I see now that this 
> was 
> being negated by the target macro, anyway! Oops, my bad for not really 
> understanding that, sorry!
>
> If we have the types always visible, then the user may use them, resulting in 
> an 
> ICE.
>
> But even if the #ifdef worked this still doesn't stop the user from trying to 
> use  __bf16 or __Bfloat16x4_t, __Bfloat16x8_t , which would still do produce 
> an 
> ICE, so it's not a perfect solution anyway...


Right.  Or they could use #pragma GCC target to switch to a different
non-bf16 target after including arm_bf16.h.

> One other thing I tried was the below change to aarch64-builtins.c which 
> stops 
> __bf16 or the vector types from being registered at all:
>
> --- a/gcc/config/aarch64/aarch64-builtins.c
> +++ b/gcc/config/aarch64/aarch64-builtins.c
> @@ -759,26 +759,32 @@ aarch64_init_simd_builtin_types (void)
>      aarch64_simd_types[Float64x1_t].eltype = double_type_node;
>      aarch64_simd_types[Float64x2_t].eltype = double_type_node;
>
> -  /* Init Bfloat vector types with underlying __bf16 type.  */
> -  aarch64_simd_types[Bfloat16x4_t].eltype = aarch64_bf16_type_node;
> -  aarch64_simd_types[Bfloat16x8_t].eltype = aarch64_bf16_type_node;
> +  if (TARGET_BF16_SIMD)
> +    {
> +      /* Init Bfloat vector types with underlying __bf16 type.  */
> +      aarch64_simd_types[Bfloat16x4_t].eltype = aarch64_bf16_type_node;
> +      aarch64_simd_types[Bfloat16x8_t].eltype = aarch64_bf16_type_node;
> +    }
>
>      for (i = 0; i < nelts; i++)
>        {
>          tree eltype = aarch64_simd_types[i].eltype;
>          machine_mode mode = aarch64_simd_types[i].mode;
>
> -      if (aarch64_simd_types[i].itype == NULL)
> +      if (eltype != NULL)
>           {
> -         aarch64_simd_types[i].itype
> -           = build_distinct_type_copy
> -             (build_vector_type (eltype, GET_MODE_NUNITS (mode)));
> -         SET_TYPE_STRUCTURAL_EQUALITY (aarch64_simd_types[i].itype);
> -       }
> +         if (aarch64_simd_types[i].itype == NULL)
> +           {
> +             aarch64_simd_types[i].itype
> +               = build_distinct_type_copy
> +               (build_vector_type (eltype, GET_MODE_NUNITS (mode)));
> +             SET_TYPE_STRUCTURAL_EQUALITY (aarch64_simd_types[i].itype);
> +           }
>
> -      tdecl = add_builtin_type (aarch64_simd_types[i].name,
> -                               aarch64_simd_types[i].itype);
> -      TYPE_NAME (aarch64_simd_types[i].itype) = tdecl;
> +         tdecl = add_builtin_type (aarch64_simd_types[i].name,
> +                                   aarch64_simd_types[i].itype);
> +         TYPE_NAME (aarch64_simd_types[i].itype) = tdecl;
> +       }
>        }
>
>    #define AARCH64_BUILD_SIGNED_TYPE(mode)  \
> @@ -1240,7 +1246,8 @@ aarch64_general_init_builtins (void)
>
>      aarch64_init_fp16_types ();
>
> -  aarch64_init_bf16_types ();
> +  if (TARGET_BF16_FP)
> +    aarch64_init_bf16_types ();
>
>      if (TARGET_SIMD)
>        aarch64_init_simd_builtins ();
>
>
>
> But the problem in that case was that it the types could not be re-enabled 
> using 
> a target pragma like:
>
> #pragma GCC push_options
> #pragma GCC target ("+bf16")
>
> Inside the test.
>
> (i.e. the pragma caused the ifdef to be TRUE, but __bf16 was still not being 
> enabled afaict?)
>
> So I'm not sure what to do, presumably we do want some guard around the type 
> so 
> as not to just ICE if the type is used without +bf16?

Other header files work both ways: you get the same definitions regardless
of what the target was when the header file was included.  Then we need
to raise an error if the user tries to do something that the current
target doesn't support.

I suppose for bf16 we could either (a) try to raise an error whenever
BF-related moves are emitted without the required target feature or
(b) handle __bf16 types like __fp16 types.  The justification for
(b) is that there aren't really any new instructions for moves;
__bf16 is mostly a software construct as far as this specific
patch goes.  (It's a different story for the intrinsics patch
of course.)

I don't know which of (a) or (b) is better.  Whichever we go for,
it would be good if clang and GCC were consistent here.

>> It would be good to have more test coverage than this.  E.g.:
>> 
>> - a test that includes arm_bf16.h, with just scalar tests.
>
> Done as test 2, but it is a small test. Is there anything I could add to it?
> (I feel like ideally I'd want to try and force it down every alternative of 
> the 
> RTL pattern)

register asms are one way of doing that, see e.g
gcc.target/aarch64/sve/struct_move_1.c

>> 
>> - a test for _Complex bfloat16_t.
>
> I don't think we currently have a decision on whether this should be 
> supported 
> or not.
> AFAICT we also don't have complex __fp16 support either. I'm getting the same 
> error messages attempting to compile a _Complex __fp16 but it's always likely 
> I'm going at this wrong!
>
> Added test 5 to show you what I was trying to do and to catch the error 
> messages 
> in their current form, but I'm not sure if I've done this right either, tbh!

Testing for an error is a good option if we don't intend to support this.
The main reason for having a test is to make sure that there's no ICE.

So the test in the new patch LGTM, thanks.

>> - a test for moves involving:
>> 
>>      typedef bfloat16_t v16bf __attribute__((vector_size(32)));
>
> Oh that's a good idea, thank you for pointing it out!
>
> See test 6 for reference.
>
> So for vector size 16, 128bits, this looks fine, loading and storing from q 
> registers (using aarch64_simd_movv8bf).
>
> For vector size 32, 256 bits, the compiler chooses to use 4*x-registers 
> instead, 
> resulting in this piece of assembler
>
> stacktest2:
>           sub     sp, sp, #64
>           ldp     x2, x3, [x0]
>           stp     x2, x3, [sp]
>           ldp     x0, x1, [x0, 16]
>           stp     x0, x1, [sp, 16]
>           ldp     x0, x1, [sp]
>           stp     x0, x1, [sp, 32]
>           ldp     x2, x3, [sp, 16]
>           stp     x2, x3, [sp, 48]
>           stp     x0, x1, [x8]
>           ldp     x0, x1, [sp, 48]
>           stp     x0, x1, [x8, 16]
>           add     sp, sp, 64
>           ret
>
> Which looks strange using regular registers in movti mode, but I tested it 
> with 
> float16 and float32 vectors and they the same also give the same result.
>
> However, using an integer vector generates:
>
> stacktest2:
>           ld1     {v0.16b - v1.16b}, [x0]
>           sub     sp, sp, #32
>           st1     {v0.16b - v1.16b}, [sp]
>           ld1     {v0.16b - v1.16b}, [sp]
>           st1     {v0.16b - v1.16b}, [x8]
>           add     sp, sp, 32
>           ret
>
> from the aarch64_movoi pattern. So now I'm unsure whether to leave this as is 
> or 
> to look into why all float modes are not being used through the seemingly 
> more 
> efficient movoi pattern. What do you think?
> (i intend to look into this further)

Haven't tried, but is this affected by -fno-split-wide-types?

But here too the main thing is to make sure that there's no ICE when
using the vectors.  Making it efficient can be (very low priority)
follow-on work.

So it's probably best not to match any specific output here.
Just testing that the moves compile is OK.

>> - a test that involves moving constants, for both scalars and vectors.
>>    You can create zero scalar constants in C++ using bfloat16_t() etc.
>>    For vectors it's possible to do things like:
>> 
>>      typedef short v2bf __attribute__((vector_size(4)));
>>      v2hi foo (void) { return (v2hi) 0x12345678; }
>> 
>>    The same sort of things should work for bfloat16x4_t and bfloat16x8_t.
>
> Leaving this as an open issue for now because I'm not 100% sure what we 
> should/shouldn't be allowing past the tree-level target hooks.
>
> If we do want to block this we would do this in the [2/2] patch.
> I will come back to it and create a scan-assembler test when I'm more clear 
> on 
> what we should and shouldn't allow at the higher level :)

FWIW, I'm not sure we should go out of our way to disallow this.
Preventing bfloat16_t() in C++ would IMO be unnatural.  And the
"(vector) vector-sized-integer" syntax specifically treats the vector
as a bundle of bits without really caring what the element type is.
Even if we did manage to forbid the conversion in that context,
it would still be possible to achieve the same thing using:

   v2hi
   foo (void)
   {
     union { v2hi v; unsigned int i; } u;
     u.i = 0x12345678;
     return u.v;
   }

Thanks for the new patch, looks good apart from the points above and:

> +;; Iterator for all scalar floating point modes suitable for moving, 
> including
> +;; special BF type.(HF, SF, DF, TF and BF)

Nit: should be space rather than "." before "(".

> +(define_mode_iterator GPF_TF_F16_MOV [(HF "") (BF "TARGET_BF16_FP") (SF "")
> +                                   (DF "") (TF "")])
> +
>  ;; Double vector modes.
>  (define_mode_iterator VDF [V2SF V4HF])
>  
> @@ -79,6 +87,9 @@
>  ;; Double vector modes.
>  (define_mode_iterator VD [V8QI V4HI V4HF V2SI V2SF])
>  
> +;; Double vector modes suitable for moving.  Includes BFmode.
> +(define_mode_iterator VDMOV [V8QI V4HI V4HF V4BF V2SI V2SF])
> +
>  ;; All modes stored in registers d0-d31.
>  (define_mode_iterator DREG [V8QI V4HI V4HF V2SI V2SF DF])
>  
> @@ -94,6 +105,9 @@
>  ;; Quad vector modes.
>  (define_mode_iterator VQ [V16QI V8HI V4SI V2DI V8HF V4SF V2DF])
>  
> +;; Quad vector modes suitable for moving.  Includes BFmode.
> +(define_mode_iterator VQMOV [V16QI V8HI V4SI V2DI V8HF V8BF V4SF V2DF])
> +
>  ;; Copy of the above.
>  (define_mode_iterator VQ2 [V16QI V8HI V4SI V2DI V8HF V4SF V2DF])
>  

This looks a bit inconsistent: the scalar iterator requires
TARGET_BF16_FP for bf16 modes, but the vector iterator doesn't.

> @@ -160,6 +174,15 @@
>  (define_mode_iterator VALL_F16 [V8QI V16QI V4HI V8HI V2SI V4SI V2DI
>                               V4HF V8HF V2SF V4SF V2DF])
>  
> +;; All Advanced SIMD modes suitable for moving, loading, and storing,
> +;; including special Bfloat vector types.
> +(define_mode_iterator VALL_F16MOV [(V8QI "") (V16QI "") (V4HI "") (V8HI "")
> +                                (V2SI "") (V4SI "") (V2DI "")
> +                                (V4HF "") (V8HF "")
> +                                (V4BF "TARGET_BF16_SIMD")
> +                                (V8BF "TARGET_BF16_SIMD")
> +                                (V2SF "") (V4SF "") (V2DF "")])
> +
>  ;; The VALL_F16 modes except the 128-bit 2-element ones.
>  (define_mode_iterator VALL_F16_NO_V2Q [V8QI V16QI V4HI V8HI V2SI V4SI
>                               V4HF V8HF V2SF V4SF])

whereas here we do check.  But that comes back to the (a)/(b) choice above.

> diff --git a/gcc/testsuite/gcc.target/aarch64/bfloat16_compile_1.c 
> b/gcc/testsuite/gcc.target/aarch64/bfloat16_compile_1.c
> new file mode 100644
> index 00000000000..f2bef671deb
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/bfloat16_compile_1.c
> @@ -0,0 +1,51 @@
> +/* { dg-do assemble { target { aarch64*-*-* } } } */
> +/* { dg-require-effective-target arm_v8_2a_bf16_neon_ok } */
> +/* { dg-add-options arm_v8_2a_bf16_neon }  */
> +/* { dg-additional-options "-O3 --save-temps" } */
> +/* { dg-final { check-function-bodies "**" "" } } */
> +
> +#include <arm_neon.h>
> +
> +/*
> +**stacktest1:
> +**   ...
> +**   str     h0, \[sp, [0-9]+\]
> +**   ldr     h0, \[sp, [0-9]+\]
> +**   ...
> +**   ret
> +*/
> +bfloat16_t stacktest1 (bfloat16_t __a)
> +{
> +  volatile bfloat16_t b = __a;
> +  return b;
> +}
> +
> +/*
> +**stacktest2:
> +**   ...
> +**   str     d0, \[sp, [0-9]+\]
> +**   ldr     d0, \[sp, [0-9]+\]
> +**   ...
> +**   ret
> +*/
> +bfloat16x4_t stacktest2 (bfloat16x4_t __a)
> +{
> +  volatile bfloat16x4_t b = __a;
> +  return b;
> +}
> +
> +/*
> +**stacktest3:
> +**   ...
> +**   str     q0, \[sp\]
> +**   ldr     q0, \[sp\]
> +**   ...
> +**   ret
> +*/
> +bfloat16x8_t stacktest3 (bfloat16x8_t __a)
> +{
> +  volatile bfloat16x8_t b = __a;
> +  return b;
> +}

Might be a daft question, but why do we have an offset for the first
two and not for the last one?  Might be worth hard-coding whatever
offset we use.

If we use -fomit-frame-pointer then the whole function body should
be stable: sub, str, ldr, add, ret.

> @@ -0,0 +1,21 @@
> +/* { dg-do assemble { target { aarch64*-*-* } } } */
> +/* { dg-require-effective-target arm_v8_2a_bf16_neon_ok } */
> +/* { dg-add-options arm_v8_2a_bf16_neon }  */
> +/* { dg-additional-options "-O3 --save-temps" } */
> +/* { dg-final { check-function-bodies "**" "" } } */
> +
> +#include <arm_bf16.h>
> +
> +/*
> +**stacktest1:
> +**   ...
> +**   str     h0, \[sp, [0-9]+\]
> +**   ldr     h0, \[sp, [0-9]+\]
> +**   ...
> +**   ret
> +*/
> +bfloat16_t stacktest1 (bfloat16_t __a)
> +{
> +  volatile bfloat16_t b = __a;
> +  return b;
> +}

Same comment here.

> diff --git a/gcc/testsuite/gcc.target/aarch64/bfloat16_compile_3.c 
> b/gcc/testsuite/gcc.target/aarch64/bfloat16_compile_3.c
> new file mode 100644
> index 00000000000..9bcb53b32d8
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/bfloat16_compile_3.c
> @@ -0,0 +1,25 @@
> +/* { dg-do compile } */
> +/* { dg-options "-march=armv8.2-a -O2" } */
> +/* { dg-final { check-function-bodies "**" "" } } */
> +
> +#pragma GCC push_options
> +#pragma GCC target ("+bf16")
> +
> +#include <arm_bf16.h>
> +
> +/*
> +**stacktest1:
> +**   ...
> +**   str     h0, \[sp, [0-9]+\]
> +**   ldr     h0, \[sp, [0-9]+\]
> +**   ...
> +**   ret
> +*/
> +bfloat16_t stacktest1 (bfloat16_t __a)
> +{
> +  volatile bfloat16_t b = __a;
> +  return b;
> +}
> +
> +#pragma GCC pop_options

Here too.  No real need for the push & pop, but keeping them is fine
if that seems more obvious.

Thanks,
Richard

Re: [GCC][PATCH][Aarch64] Add Bfloat16_t scalar type, vector types and machine modes to Aarch64 back-end [1/2]

Reply via email to