On Fri, Jul 12, 2024 at 12:44 PM Tejas Belagod <tejas.bela...@arm.com> wrote: > > On 7/12/24 11:46 AM, Richard Biener wrote: > > On Fri, Jul 12, 2024 at 6:17 AM Tejas Belagod <tejas.bela...@arm.com> wrote: > >> > >> On 7/10/24 4:37 PM, Richard Biener wrote: > >>> On Wed, Jul 10, 2024 at 12:44 PM Richard Sandiford > >>> <richard.sandif...@arm.com> wrote: > >>>> > >>>> Tejas Belagod <tejas.bela...@arm.com> writes: > >>>>> On 7/10/24 2:38 PM, Richard Biener wrote: > >>>>>> On Wed, Jul 10, 2024 at 10:49 AM Tejas Belagod <tejas.bela...@arm.com> > >>>>>> wrote: > >>>>>>> > >>>>>>> On 7/9/24 4:22 PM, Richard Biener wrote: > >>>>>>>> On Tue, Jul 9, 2024 at 11:45 AM Tejas Belagod > >>>>>>>> <tejas.bela...@arm.com> wrote: > >>>>>>>>> > >>>>>>>>> On 7/8/24 4:45 PM, Richard Biener wrote: > >>>>>>>>>> On Mon, Jul 8, 2024 at 11:27 AM Tejas Belagod > >>>>>>>>>> <tejas.bela...@arm.com> wrote: > >>>>>>>>>>> > >>>>>>>>>>> Hi, > >>>>>>>>>>> > >>>>>>>>>>> Sorry to have dropped the ball on > >>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625535.html, > >>>>>>>>>>> but > >>>>>>>>>>> here I've tried to pick it up again and write up a strawman > >>>>>>>>>>> proposal for > >>>>>>>>>>> elevating __attribute__((vector_mask)) to the FE from GIMPLE. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Thanks, > >>>>>>>>>>> Tejas. > >>>>>>>>>>> > >>>>>>>>>>> Motivation > >>>>>>>>>>> ---------- > >>>>>>>>>>> > >>>>>>>>>>> The idea of packed boolean vectors came about when we wanted to > >>>>>>>>>>> support > >>>>>>>>>>> C/C++ operators on SVE ACLE types. The current vector boolean > >>>>>>>>>>> type that > >>>>>>>>>>> ACLE specifies does not adequately disambiguate vector lane sizes > >>>>>>>>>>> which > >>>>>>>>>>> they were derived off of. Consider this simple, albeit > >>>>>>>>>>> unrealistic, example: > >>>>>>>>>>> > >>>>>>>>>>> bool foo (svint32_t a, svint32_t b) > >>>>>>>>>>> { > >>>>>>>>>>> svbool_t p = a > b; > >>>>>>>>>>> > >>>>>>>>>>> // Here p[2] is not the same as a[2] > b[2]. > >>>>>>>>>>> return p[2]; > >>>>>>>>>>> } > >>>>>>>>>>> > >>>>>>>>>>> In the above example, because svbool_t has a fixed > >>>>>>>>>>> 1-lane-per-byte, p[i] > >>>>>>>>>>> does not return the bool value corresponding to a[i] > b[i]. This > >>>>>>>>>>> necessitates a 'typed' vector boolean value that unambiguously > >>>>>>>>>>> represents results of operations > >>>>>>>>>>> of the same type. > >>>>>>>>>>> > >>>>>>>>>>> __attribute__((vector_mask)) > >>>>>>>>>>> ----------------------------- > >>>>>>>>>>> > >>>>>>>>>>> Note: If interested in historical discussions refer to: > >>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625535.html > >>>>>>>>>>> > >>>>>>>>>>> We define this new attribute which when applied to a base data > >>>>>>>>>>> vector > >>>>>>>>>>> produces a new boolean vector type that represents a boolean type > >>>>>>>>>>> that > >>>>>>>>>>> is produced as a result of operations on the corresponding base > >>>>>>>>>>> vector > >>>>>>>>>>> type. The following is the syntax. > >>>>>>>>>>> > >>>>>>>>>>> typedef int v8si __attribute__((vector_size (8 * sizeof > >>>>>>>>>>> (int))); > >>>>>>>>>>> typedef v8si v8sib __attribute__((vector_mask)); > >>>>>>>>>>> > >>>>>>>>>>> Here the 'base' data vector type is v8si or a vector of 8 > >>>>>>>>>>> integers. > >>>>>>>>>>> > >>>>>>>>>>> Rules > >>>>>>>>>>> > >>>>>>>>>>> • The layout/size of the boolean vector type is > >>>>>>>>>>> implementation-defined > >>>>>>>>>>> for its base data vector type. > >>>>>>>>>>> > >>>>>>>>>>> • Two boolean vector types who's base data vector types have same > >>>>>>>>>>> number > >>>>>>>>>>> of elements and lane-width have the same layout and size. > >>>>>>>>>>> > >>>>>>>>>>> • Consequently, two boolean vectors who's base data vector types > >>>>>>>>>>> have > >>>>>>>>>>> different number of elements or different lane-size have > >>>>>>>>>>> different layouts. > >>>>>>>>>>> > >>>>>>>>>>> This aligns with gnu vector extensions that generate integer > >>>>>>>>>>> vectors as > >>>>>>>>>>> a result of comparisons - "The result of the comparison is a > >>>>>>>>>>> vector of > >>>>>>>>>>> the same width and number of elements as the comparison operands > >>>>>>>>>>> with a > >>>>>>>>>>> signed integral element type." according to > >>>>>>>>>>> > >>>>>>>>>>> https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html. > >>>>>>>>>> > >>>>>>>>>> Without having the time to re-review this all in detail I think > >>>>>>>>>> the GNU > >>>>>>>>>> vector extension does not expose the result of the comparison as > >>>>>>>>>> the > >>>>>>>>>> machine would produce it but instead a comparison "decays" to > >>>>>>>>>> a conditional: > >>>>>>>>>> > >>>>>>>>>> typedef int v4si __attribute__((vector_size(16))); > >>>>>>>>>> > >>>>>>>>>> v4si a; > >>>>>>>>>> v4si b; > >>>>>>>>>> > >>>>>>>>>> void foo() > >>>>>>>>>> { > >>>>>>>>>> auto r = a < b; > >>>>>>>>>> } > >>>>>>>>>> > >>>>>>>>>> produces, with C23: > >>>>>>>>>> > >>>>>>>>>> vector(4) int r = VEC_COND_EXPR < a < b , { -1, -1, -1, -1 > >>>>>>>>>> } , { 0, > >>>>>>>>>> 0, 0, 0 } > ; > >>>>>>>>>> > >>>>>>>>>> In fact on x86_64 with AVX and AVX512 you have two different > >>>>>>>>>> "machine > >>>>>>>>>> produced" mask types and the above could either produce a AVX mask > >>>>>>>>>> with > >>>>>>>>>> 32bit elements or a AVX512 mask with 1bit elements. > >>>>>>>>>> > >>>>>>>>>> Not exposing "native" mask types requires the compiler optimizing > >>>>>>>>>> subsequent > >>>>>>>>>> uses and makes generic vectors difficult to combine with for > >>>>>>>>>> example AVX512 > >>>>>>>>>> intrinsics (where masks are just 'int'). Across an ABI boundary > >>>>>>>>>> it's also > >>>>>>>>>> even more difficult to optimize mask transitions. > >>>>>>>>>> > >>>>>>>>>> But it at least allows portable code and it does not suffer from > >>>>>>>>>> users trying to > >>>>>>>>>> expose machine representations of masks as input to generic vector > >>>>>>>>>> code > >>>>>>>>>> with all the problems of constant folding not only requiring > >>>>>>>>>> self-consistent > >>>>>>>>>> code within the compiler but compatibility with user produced > >>>>>>>>>> constant masks. > >>>>>>>>>> > >>>>>>>>>> That said, I somewhat question the need to expose the target mask > >>>>>>>>>> layout > >>>>>>>>>> to users for GCCs generic vector extension. > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> Thanks for your feedback. > >>>>>>>>> > >>>>>>>>> IIUC, I can imagine how having a GNU vector extension exposing the > >>>>>>>>> target vector mask layout can pose a challenge - maybe making it a > >>>>>>>>> generic GNU vector extension was too ambitious. I wonder if there's > >>>>>>>>> value in pursuing these alternate paths? > >>>>>>>>> > >>>>>>>>> 1. Can implementing this extension in a 'generic' way i.e. possibly > >>>>>>>>> not > >>>>>>>>> implement it with a target mask, but just a generic int vector, > >>>>>>>>> still > >>>>>>>>> maintain the consistency of GNU predicate vectors within the > >>>>>>>>> compiler? I > >>>>>>>>> know it may not seem very different from how boolean vectors are > >>>>>>>>> currently implemented (as in your above example), but, having the > >>>>>>>>> __attribute__((vector_mask)) as a 'property' of the object makes it > >>>>>>>>> useful to optimize its uses to target predicates in subsequent > >>>>>>>>> stages of > >>>>>>>>> the compiler. > >>>>>>>>> > >>>>>>>>> 2. Restricting __attribute__((vector_mask)) to apply only to target > >>>>>>>>> intrinsic types? Eg. > >>>>>>>>> > >>>>>>>>> On SVE something like: > >>>>>>>>> typedef svint16_t svpred16_t __attribute__((vector_mask)); // OK. > >>>>>>>>> > >>>>>>>>> On AVX, something like: > >>>>>>>>> typedef __m256i __mask32 __attribute__((vector_mask)); // OK - > >>>>>>>>> though > >>>>>>>>> this would require more fine-grained defn of lane-size to mask-bits > >>>>>>>>> mapping. > >>>>>>>> > >>>>>>>> I think the target should be able to register builtin types already > >>>>>>>> which > >>>>>>>> intrinsics could use. There is already the vector_mask attribute > >>>>>>>> but only > >>>>>>>> for GIMPLE and it has the same limitation of querying the target for > >>>>>>>> the > >>>>>>>> actual mode being used - for AVX vs AVX512 one might be able to > >>>>>>>> combine this with a mode attribute. Not sure if on arm you can parse > >>>>>>>> __attribute__((mode("Vx4BI4"))) or how the modes are called. > >>>>>>>> > >>>>>>>> But when you are talking about intrinsics I'd really suggest to > >>>>>>>> leave the > >>>>>>>> type creation to the target rather than trying to do a typedef in a > >>>>>>>> header? > >>>>>>>> > >>>>>>> > >>>>>>> Yeah, thinking about this a bit more, makes sense to keep intrinsic > >>>>>>> type > >>>>>>> creation in the target realm. > >>>>>>> > >>>>>>> Just to clarify if I understand your point about exposing masks' > >>>>>>> machine > >>>>>>> representations, would representing vector_mask types using opaque > >>>>>>> types/modes have the same challenges with compatibility with generic > >>>>>>> vector constants as it essentially would be a parallel type system, > >>>>>>> and > >>>>>>> would be unaffected by constant-folding etc due to their opacity? I > >>>>>>> ask > >>>>>>> because opacity might give the representation the flexibility of > >>>>>>> 'decaying' to a type based on the context it is used in. > >>>>>> > >>>>>> I also thought about using an opaque type but I wonder if it really > >>>>>> suits > >>>>>> here? > >>>>> > >>>>> Sorry, yes using opaque type was your idea from last year's thread - I > >>>>> merely reiterated it here. :-) > >>>>> > >>>>> Or would the target then need to decay a mask[i] into something > >>>>>> that's later recognizable? > >>>>>> > >>>>> > >>>>> I think that would depend on the usage, wouldn't it - it could lower > >>>>> down to target insn(s) based on how whether, for eg, its used as a test > >>>>> or read as a scalar value? > >>>>> > >>>>> > >>>>>> So I guess the answer is you'd have to try. > >>>>> > >>>>> Thanks for your feedback so far - much appreciated. If it helps, I will > >>>>> try to write up a prototype to test the idea - might help clear the mist > >>>>> further. > >>>> > >>>> Just to note that one of the original motivations (that applies more > >>>> to option 3 from last year's proposal) was to add support for general > >>>> packed vector boolean types to the GNU vector extension, as a feature > >>>> independent of the target's "native" format(s). Clang already supports > >>>> this via ext_vector_type and it seemed like there might be value in > >>>> providing something similar for the GNU extensions. > >>> > >>> But that's more for data, aka vector bool, not for what's produced by > >>> targets from vector comparisons? So yes, I suppose that's reasonable > >>> but representation would then be fully defined by the extension > >>> rather than by however the target computes the actual comparison > >>> result vector. > >>> > >> > >> Sorry for the slow response. > >> > >> Thanks RichardS for your timely comment. Sorry, I might have gotten > >> ambitious with the original vector bool proposal and went down the route > >> of supporting 'native' formats with vector_mask, but scaling my > >> ambitions back to a boolean vector of a certain representation that is > >> independent of the target's native format and defined by the extension > >> itself is a more realistic proposition. > >> > >> To reiterate option 3 from last year's proposal, currently we don't support > >> > >> typedef bool vbool __attribute__((__vector_size__(64))); > >> > >> But if we did, could we support a more layout-friendly form i.e. > >> > >> typedef bool vbool __attribute__((vector_size (s, n[, w]))); > >> > >> where 's' is size in bytes, 'n' is the number of lanes and an optional > >> 3rd parameter 'w' is the number of bits of the PBV that represents a > >> lane of the target vector? 'w' would allow a target to force a certain > >> layout of the PBV. > > > > isn't one of s, n or w redundant? That is, w == (s * 8) / n? Or > > would vector_size (8, 32, 1) put in 1 bit of "padding" per lane? > > (but where?) > > > > In my mind, I imagined a sparse layout that padded the ((s * 8)/n - w) > bits in each lane's MSB. But I guess even if we chose a packed layout > and padded the extra bits at the MSB of the full vector, the complexity > of implementing operations on them wouldn't change. I admit I haven't > thought through all cases here. > > > That said, how about > > > > typedef unsigned _BitInt(1) vbool __attribute__((vector_size (8))); > > > > instead? Slight complication is that _BitInt isn't supported in C++, > > but I suppose that could be fixed at least as extension? > > > > As we currently reject > > > > typedef _Bool vbool __attribute__((vector_size (8))); > > > > we can also chose to accept that as the 1-bit case at least. > > I remember there were some thoughts about using _BitInt last year but > didn't make it into the original proposal as _BitInt support was being > developed at the time this proposal was being discussed. From what I > understand about _BitInt, it looks plausible, but again I haven't > thought through it. Would the number of lanes be implicit here? I.e. for > > typedef unsigned _BitInt(N) vbool __attribute__((vector_size (S))); > > would L = (S * 8) / N ? As in the original scheme, I guess the padding > position of extra bits is still something that needs to be decided?
Padding is only an issue for very small vectors - the obvious choice is to disallow vector types that would require any padding. I can hardly see where those are faster than using a vector of up to 4 char elements. Problematic are 1-bit elements with 4, 2 or one element vectors, 2-bit elements with 2 or one element vectors and 4-bit elements with 1 element vectors. > Another thing is I don't know how much work there is to support _BitInt > for C++ or what limitations there are in supporting it in C++. Well, if you have a 2-bit element vector you have to somehow specify what decltype (v[0]) yields. Would it behave like bitfields in structs? I think for efficient operation you'd like to avoid promotion to int which means _BitInt feels like a natural choice. I have no idea what the stance of supporting _BitInt in C++ are, but most certainly diverging support (or even semantics) of the vector extension in C vs. C++ is undesirable. > > > >> I don't know if overloading vector_size is a good idea though... > > > > Is there precedent in other compilers for supporting bit-precision > > vector components in extensions to GCCs vector extension? > > > > Not that I know of. So far Clang's ext_vector_type is the closest one > I've come across. > > Thanks, > Tejas. > > > Richard. > > > >> Thanks, > >> Tejas. > >> > >> > >>> Richard. > >>> > >>>> Thanks, > >>>> Richard > >>>> > >>>> > >> >