On Fri, Jul 12, 2024 at 12:44 PM Tejas Belagod <tejas.bela...@arm.com> wrote:
>
> On 7/12/24 11:46 AM, Richard Biener wrote:
> > On Fri, Jul 12, 2024 at 6:17 AM Tejas Belagod <tejas.bela...@arm.com> wrote:
> >>
> >> On 7/10/24 4:37 PM, Richard Biener wrote:
> >>> On Wed, Jul 10, 2024 at 12:44 PM Richard Sandiford
> >>> <richard.sandif...@arm.com> wrote:
> >>>>
> >>>> Tejas Belagod <tejas.bela...@arm.com> writes:
> >>>>> On 7/10/24 2:38 PM, Richard Biener wrote:
> >>>>>> On Wed, Jul 10, 2024 at 10:49 AM Tejas Belagod <tejas.bela...@arm.com> 
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> On 7/9/24 4:22 PM, Richard Biener wrote:
> >>>>>>>> On Tue, Jul 9, 2024 at 11:45 AM Tejas Belagod 
> >>>>>>>> <tejas.bela...@arm.com> wrote:
> >>>>>>>>>
> >>>>>>>>> On 7/8/24 4:45 PM, Richard Biener wrote:
> >>>>>>>>>> On Mon, Jul 8, 2024 at 11:27 AM Tejas Belagod 
> >>>>>>>>>> <tejas.bela...@arm.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi,
> >>>>>>>>>>>
> >>>>>>>>>>> Sorry to have dropped the ball on
> >>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625535.html, 
> >>>>>>>>>>> but
> >>>>>>>>>>> here I've tried to pick it up again and write up a strawman 
> >>>>>>>>>>> proposal for
> >>>>>>>>>>> elevating __attribute__((vector_mask)) to the FE from GIMPLE.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Tejas.
> >>>>>>>>>>>
> >>>>>>>>>>> Motivation
> >>>>>>>>>>> ----------
> >>>>>>>>>>>
> >>>>>>>>>>> The idea of packed boolean vectors came about when we wanted to 
> >>>>>>>>>>> support
> >>>>>>>>>>> C/C++ operators on SVE ACLE types. The current vector boolean 
> >>>>>>>>>>> type that
> >>>>>>>>>>> ACLE specifies does not adequately disambiguate vector lane sizes 
> >>>>>>>>>>> which
> >>>>>>>>>>> they were derived off of. Consider this simple, albeit 
> >>>>>>>>>>> unrealistic, example:
> >>>>>>>>>>>
> >>>>>>>>>>>         bool foo (svint32_t a, svint32_t b)
> >>>>>>>>>>>         {
> >>>>>>>>>>>           svbool_t p = a > b;
> >>>>>>>>>>>
> >>>>>>>>>>>           // Here p[2] is not the same as a[2] > b[2].
> >>>>>>>>>>>           return p[2];
> >>>>>>>>>>>         }
> >>>>>>>>>>>
> >>>>>>>>>>> In the above example, because svbool_t has a fixed 
> >>>>>>>>>>> 1-lane-per-byte, p[i]
> >>>>>>>>>>> does not return the bool value corresponding to a[i] > b[i]. This
> >>>>>>>>>>> necessitates a 'typed' vector boolean value that unambiguously
> >>>>>>>>>>> represents results of operations
> >>>>>>>>>>> of the same type.
> >>>>>>>>>>>
> >>>>>>>>>>> __attribute__((vector_mask))
> >>>>>>>>>>> -----------------------------
> >>>>>>>>>>>
> >>>>>>>>>>> Note: If interested in historical discussions refer to:
> >>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625535.html
> >>>>>>>>>>>
> >>>>>>>>>>> We define this new attribute which when applied to a base data 
> >>>>>>>>>>> vector
> >>>>>>>>>>> produces a new boolean vector type that represents a boolean type 
> >>>>>>>>>>> that
> >>>>>>>>>>> is produced as a result of operations on the corresponding base 
> >>>>>>>>>>> vector
> >>>>>>>>>>> type. The following is the syntax.
> >>>>>>>>>>>
> >>>>>>>>>>>         typedef int v8si __attribute__((vector_size (8 * sizeof 
> >>>>>>>>>>> (int)));
> >>>>>>>>>>>         typedef v8si v8sib __attribute__((vector_mask));
> >>>>>>>>>>>
> >>>>>>>>>>> Here the 'base' data vector type is v8si or a vector of 8 
> >>>>>>>>>>> integers.
> >>>>>>>>>>>
> >>>>>>>>>>> Rules
> >>>>>>>>>>>
> >>>>>>>>>>> • The layout/size of the boolean vector type is 
> >>>>>>>>>>> implementation-defined
> >>>>>>>>>>> for its base data vector type.
> >>>>>>>>>>>
> >>>>>>>>>>> • Two boolean vector types who's base data vector types have same 
> >>>>>>>>>>> number
> >>>>>>>>>>> of elements and lane-width have the same layout and size.
> >>>>>>>>>>>
> >>>>>>>>>>> • Consequently, two boolean vectors who's base data vector types 
> >>>>>>>>>>> have
> >>>>>>>>>>> different number of elements or different lane-size have 
> >>>>>>>>>>> different layouts.
> >>>>>>>>>>>
> >>>>>>>>>>> This aligns with gnu vector extensions that generate integer 
> >>>>>>>>>>> vectors as
> >>>>>>>>>>> a result of comparisons - "The result of the comparison is a 
> >>>>>>>>>>> vector of
> >>>>>>>>>>> the same width and number of elements as the comparison operands 
> >>>>>>>>>>> with a
> >>>>>>>>>>> signed integral element type." according to
> >>>>>>>>>>>          
> >>>>>>>>>>> https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html.
> >>>>>>>>>>
> >>>>>>>>>> Without having the time to re-review this all in detail I think 
> >>>>>>>>>> the GNU
> >>>>>>>>>> vector extension does not expose the result of the comparison as 
> >>>>>>>>>> the
> >>>>>>>>>> machine would produce it but instead a comparison "decays" to
> >>>>>>>>>> a conditional:
> >>>>>>>>>>
> >>>>>>>>>> typedef int v4si __attribute__((vector_size(16)));
> >>>>>>>>>>
> >>>>>>>>>> v4si a;
> >>>>>>>>>> v4si b;
> >>>>>>>>>>
> >>>>>>>>>> void foo()
> >>>>>>>>>> {
> >>>>>>>>>>        auto r = a < b;
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> produces, with C23:
> >>>>>>>>>>
> >>>>>>>>>>        vector(4) int r =  VEC_COND_EXPR < a < b , { -1, -1, -1, -1 
> >>>>>>>>>> } , { 0,
> >>>>>>>>>> 0, 0, 0 } > ;
> >>>>>>>>>>
> >>>>>>>>>> In fact on x86_64 with AVX and AVX512 you have two different 
> >>>>>>>>>> "machine
> >>>>>>>>>> produced" mask types and the above could either produce a AVX mask 
> >>>>>>>>>> with
> >>>>>>>>>> 32bit elements or a AVX512 mask with 1bit elements.
> >>>>>>>>>>
> >>>>>>>>>> Not exposing "native" mask types requires the compiler optimizing 
> >>>>>>>>>> subsequent
> >>>>>>>>>> uses and makes generic vectors difficult to combine with for 
> >>>>>>>>>> example AVX512
> >>>>>>>>>> intrinsics (where masks are just 'int').  Across an ABI boundary 
> >>>>>>>>>> it's also
> >>>>>>>>>> even more difficult to optimize mask transitions.
> >>>>>>>>>>
> >>>>>>>>>> But it at least allows portable code and it does not suffer from 
> >>>>>>>>>> users trying to
> >>>>>>>>>> expose machine representations of masks as input to generic vector 
> >>>>>>>>>> code
> >>>>>>>>>> with all the problems of constant folding not only requiring 
> >>>>>>>>>> self-consistent
> >>>>>>>>>> code within the compiler but compatibility with user produced 
> >>>>>>>>>> constant masks.
> >>>>>>>>>>
> >>>>>>>>>> That said, I somewhat question the need to expose the target mask 
> >>>>>>>>>> layout
> >>>>>>>>>> to users for GCCs generic vector extension.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks for your feedback.
> >>>>>>>>>
> >>>>>>>>> IIUC, I can imagine how having a GNU vector extension exposing the
> >>>>>>>>> target vector mask layout can pose a challenge - maybe making it a
> >>>>>>>>> generic GNU vector extension was too ambitious. I wonder if there's
> >>>>>>>>> value in pursuing these alternate paths?
> >>>>>>>>>
> >>>>>>>>> 1. Can implementing this extension in a 'generic' way i.e. possibly 
> >>>>>>>>> not
> >>>>>>>>> implement it with a target mask, but just a generic int vector, 
> >>>>>>>>> still
> >>>>>>>>> maintain the consistency of GNU predicate vectors within the 
> >>>>>>>>> compiler? I
> >>>>>>>>> know it may not seem very different from how boolean vectors are
> >>>>>>>>> currently implemented (as in your above example), but, having the
> >>>>>>>>> __attribute__((vector_mask)) as a 'property' of the object makes it
> >>>>>>>>> useful to optimize its uses to target predicates in subsequent 
> >>>>>>>>> stages of
> >>>>>>>>> the compiler.
> >>>>>>>>>
> >>>>>>>>> 2. Restricting __attribute__((vector_mask)) to apply only to target
> >>>>>>>>> intrinsic types? Eg.
> >>>>>>>>>
> >>>>>>>>> On SVE something like:
> >>>>>>>>> typedef svint16_t svpred16_t __attribute__((vector_mask)); // OK.
> >>>>>>>>>
> >>>>>>>>> On AVX, something like:
> >>>>>>>>> typedef __m256i __mask32 __attribute__((vector_mask)); // OK - 
> >>>>>>>>> though
> >>>>>>>>> this would require more fine-grained defn of lane-size to mask-bits 
> >>>>>>>>> mapping.
> >>>>>>>>
> >>>>>>>> I think the target should be able to register builtin types already 
> >>>>>>>> which
> >>>>>>>> intrinsics could use.  There is already the vector_mask attribute 
> >>>>>>>> but only
> >>>>>>>> for GIMPLE and it has the same limitation of querying the target for 
> >>>>>>>> the
> >>>>>>>> actual mode being used - for AVX vs AVX512 one might be able to
> >>>>>>>> combine this with a mode attribute.  Not sure if on arm you can parse
> >>>>>>>> __attribute__((mode("Vx4BI4"))) or how the modes are called.
> >>>>>>>>
> >>>>>>>> But when you are talking about intrinsics I'd really suggest to 
> >>>>>>>> leave the
> >>>>>>>> type creation to the target rather than trying to do a typedef in a 
> >>>>>>>> header?
> >>>>>>>>
> >>>>>>>
> >>>>>>> Yeah, thinking about this a bit more, makes sense to keep intrinsic 
> >>>>>>> type
> >>>>>>> creation in the target realm.
> >>>>>>>
> >>>>>>> Just to clarify if I understand your point about exposing masks' 
> >>>>>>> machine
> >>>>>>> representations, would representing vector_mask types using opaque
> >>>>>>> types/modes have the same challenges with compatibility with generic
> >>>>>>> vector constants as it essentially would be a parallel type system, 
> >>>>>>> and
> >>>>>>> would be unaffected by constant-folding etc due to their opacity? I 
> >>>>>>> ask
> >>>>>>> because opacity might give the representation the flexibility of
> >>>>>>> 'decaying' to a type based on the context it is used in.
> >>>>>>
> >>>>>> I also thought about using an opaque type but I wonder if it really 
> >>>>>> suits
> >>>>>> here?
> >>>>>
> >>>>> Sorry, yes using opaque type was your idea from last year's thread - I
> >>>>> merely reiterated it here. :-)
> >>>>>
> >>>>> Or would the target then need to decay a mask[i] into something
> >>>>>> that's later recognizable?
> >>>>>>
> >>>>>
> >>>>> I think that would depend on the usage, wouldn't it - it could lower
> >>>>> down to target insn(s) based on how whether, for eg, its used as a test
> >>>>> or read as a scalar value?
> >>>>>
> >>>>>
> >>>>>> So I guess the answer is you'd have to try.
> >>>>>
> >>>>> Thanks for your feedback so far - much appreciated. If it helps, I will
> >>>>> try to write up a prototype to test the idea - might help clear the mist
> >>>>> further.
> >>>>
> >>>> Just to note that one of the original motivations (that applies more
> >>>> to option 3 from last year's proposal) was to add support for general
> >>>> packed vector boolean types to the GNU vector extension, as a feature
> >>>> independent of the target's "native" format(s).  Clang already supports
> >>>> this via ext_vector_type and it seemed like there might be value in
> >>>> providing something similar for the GNU extensions.
> >>>
> >>> But that's more for data, aka vector bool, not for what's produced by
> >>> targets from vector comparisons?  So yes, I suppose that's reasonable
> >>> but representation would then be fully defined by the extension
> >>> rather than by however the target computes the actual comparison
> >>> result vector.
> >>>
> >>
> >> Sorry for the slow response.
> >>
> >> Thanks RichardS for your timely comment. Sorry, I might have gotten
> >> ambitious with the original vector bool proposal and went down the route
> >> of supporting 'native' formats with vector_mask, but scaling my
> >> ambitions back to a boolean vector of a certain representation that is
> >> independent of the target's native format and defined by the extension
> >> itself is a more realistic proposition.
> >>
> >> To reiterate option 3 from last year's proposal, currently we don't support
> >>
> >>    typedef bool vbool __attribute__((__vector_size__(64)));
> >>
> >> But if we did, could we support a more layout-friendly form i.e.
> >>
> >>     typedef bool vbool __attribute__((vector_size (s, n[, w])));
> >>
> >> where 's' is size in bytes, 'n' is the number of lanes and an optional
> >> 3rd parameter 'w' is the number of bits of the PBV that represents a
> >> lane of the target vector? 'w' would allow a target to force a certain
> >> layout of the PBV.
> >
> > isn't one of s, n or w redundant?  That is, w == (s * 8) / n?  Or
> > would vector_size (8, 32, 1) put in 1 bit of "padding" per lane?
> > (but where?)
> >
>
> In my mind, I imagined a sparse layout that padded the ((s * 8)/n - w)
> bits in each lane's MSB. But I guess even if we chose a packed layout
> and padded the extra bits at the MSB of the full vector, the complexity
> of implementing operations on them wouldn't change. I admit I haven't
> thought through all cases here.
>
> > That said, how about
> >
> > typedef unsigned _BitInt(1) vbool __attribute__((vector_size (8)));
> >
> > instead?  Slight complication is that _BitInt isn't supported in C++,
> > but I suppose that could be fixed at least as extension?
> >
> > As we currently reject
> >
> > typedef _Bool vbool __attribute__((vector_size (8)));
> >
> > we can also chose to accept that as the 1-bit case at least.
>
> I remember there were some thoughts about using _BitInt last year but
> didn't make it into the original proposal as _BitInt support was being
> developed at the time this proposal was being discussed. From what I
> understand about _BitInt, it looks plausible, but again I haven't
> thought through it. Would the number of lanes be implicit here? I.e. for
>
> typedef unsigned _BitInt(N) vbool __attribute__((vector_size (S)));
>
> would L = (S * 8) / N ? As in the original scheme, I guess the padding
> position of extra bits is still something that needs to be decided?

Padding is only an issue for very small vectors - the obvious choice is
to disallow vector types that would require any padding.  I can hardly
see where those are faster than using a vector of up to 4 char elements.
Problematic are 1-bit elements with 4, 2 or one element vectors, 2-bit elements
with 2 or one element vectors and 4-bit elements with 1 element vectors.

> Another thing is I don't know how much work there is to support _BitInt
> for C++ or what limitations there are in supporting it in C++.

Well, if you have a 2-bit element vector you have to somehow
specify what decltype (v[0]) yields.  Would it behave like bitfields in
structs?  I think for efficient operation you'd like to avoid promotion
to int which means _BitInt feels like a natural choice.

I have no idea what the stance of supporting _BitInt in C++ are,
but most certainly diverging support (or even semantics) of the
vector extension in C vs. C++ is undesirable.

> >
> >> I don't know if overloading vector_size is a good idea though...
> >
> > Is there precedent in other compilers for supporting bit-precision
> > vector components in extensions to GCCs vector extension?
> >
>
> Not that I know of. So far Clang's ext_vector_type is the closest one
> I've come across.
>
> Thanks,
> Tejas.
>
> > Richard.
> >
> >> Thanks,
> >> Tejas.
> >>
> >>
> >>> Richard.
> >>>
> >>>> Thanks,
> >>>> Richard
> >>>>
> >>>>
> >>
>

Reply via email to