On August 3, 2017 7:05:05 PM GMT+02:00, Richard Sandiford <richard.sandif...@arm.com> wrote: >Torvald Riegel <trie...@redhat.com> writes: >> On Wed, 2017-08-02 at 17:59 +0100, Richard Sandiford wrote: >>> Torvald Riegel <trie...@redhat.com> writes: >>> > On Wed, 2017-08-02 at 14:09 +0100, Richard Sandiford wrote: >>> >> (1) Does the approach seem reasonable? >>> >> >>> >> (2) Would it be acceptable in principle to add this extension >to the >>> >> GCC C frontend (only enabled where necessary)? >>> >> >>> >> (3) Should we submit this to the standards committee? >>> > >>> > I hadn't have time to look at the proposal in detail. I think it >would >>> > be good to have the standards committees review this. I doubt you >could >>> > find consensus in the C++ for type system changes unless you have >a >>> > really good reason. Have you considered how you could use the ARM >>> > extensions from http://wg21.link/p0214r4 ? >>> >>> Yeah, we've been following that proposal, but I don't think it helps >>> as-is with SVE. datapar<T> is "an array of target-specific size, >>> with elements of type T, ..." and for SVE the natural >target-specific >>> size would be a runtime value. The core language would still need >to >>> provide a way of creating that array. >> >> I think the actual data will have a size -- your proposal seems to >try >> to express a control structure (ie, SIMD / loop-like) by changing the >> type system. This seems odd to me. >> >> Why can't you keep the underlying data have a size (ie, be an array), >> and change your operations to work on arrays or slices of arrays? >That >> won't help with automatic-storage-duration variables that one would >need >> as temporaries, but perhaps it would be better to let programmers >> declare these variables as large vectors and have the compiler figure >> out what size they really need to be if only accessed through the SVE >> operations as temporary storage? > >The types only really exist for objects of automatic storage duration >and for passing to and returning from functions. Like you say, the >original input and final result will be normal arrays. > >For example, the vector function underlying: > > #pragma omp declare simd > double sin(double); > >would be: > > svfloat64_t mangled_sin(svfloat64_t, svbool_t); > >(The svbool_t is because SVE functions should be predicated by default, >to avoid the need for a scalar tail.) > >These svfloat64_t and svbool_t types have no fixed size at compile >time: >they represent one SVE register's worth of data, however big that >register >happens to be. Making datapar<T> be an array of a specific size would >make it unsuitable here. > >To put it another way: the calling conventions do have the concept >of a register-sized vector that can be passed and returned efficiently. >These ACLE types are the C manifestations of those register-sized ABI >types. >If instead we said that SVE vectors should be implicitly extracted from >a larger array, the ABI type would not have a direct representation in >C. >I can't think of another case where that's true. > >Leaving aside the question of vector library functions, if functions >used arrays for temporary results, and the ACLE intrinsics only >operated >on slices of those arrays, it wouldn't always be obvious how big the >arrays should be. For example, here's a naive ACLE implementation of a >step-1 daxpy (quoting only to show the use of the types, since a >walkthrough of the behaviour might be off-topic): > > void daxpy_1_1(int64_t n, double da, double *dx, double *dy) > { > int64_t i = 0; > svbool_t pg = svwhilelt_b64(i, n); > do > { > svfloat64_t dx_vec = svld1(pg, &dx[i]); > svfloat64_t dy_vec = svld1(pg, &dy[i]); > svst1(pg, &dy[i], svmla_x(pg, dy_vec, dx_vec, da)); > i += svcntd(); > pg = svwhilelt_b64(i, n); > } > while (svptest_any(svptrue_b64(), pg)); > }
Int64_t I; Float DX = SVE_load (&I, &DX[0], n); Float dy = SVE_load (&I, &dy[0], n); SVE_store (&dy[0], DX * a + dy); The &I args to the load would be optional in case you need the active lane somewhere. So the scalar temporary 'middle-end array' way would be a data parallel programming paradigm. For ABI purposes I suggest to use attributes on the function to change scalars to SVE vectors. Using scalars has the advantage that regular optimizations can be applied, Inlining works and that you can easily lower this to scalar or other architectures vector code. With vectors this is also way easier than with strided multi-dimensiomal arrays ;) (Sorry for typos, writing this in my mobile phone...). Richard. > >This isn't a good motivating example for why the ACLE is needed, >since the compiler ought to produce similar code from a simple scalar >loop. >But if you were writing a less naive implementation for SVE, it would >use >the ACLE in a similar way. > >The point is that this implementation supports any vector length. >There's no hard limit on the size of the temporaries. > >A perhaps more useful example is a naive implementation of a loop that >converts non-printable ASCII characters to '.' (obviously not a common >time-critical operation, but it has the advantage of being short and >using a few SVE-specific features): > > void f(uint8_t *a) > { > svbool_t trueb = svptrue_b8(); > svuint8_t dots = svdup_u8('.'); > svbool_t terminators; > do > { > svwrffr(trueb); > svuint8_t data = svldff1(trueb, a); > svbool_t ld_mask = svrdffr(); > svbool_t nonascii = svcmplt(ld_mask, data, ' '-1); > terminators = svcmpeq(ld_mask, data, 0); > svbool_t st_mask = svbrkb_z(nonascii, terminators); > svst1(st_mask, a, dots); > a += svcntp_b8(trueb, ld_mask); > } > while (!svptest_any(trueb, terminators)); > } > >Again, a walkthrough of the code might be off-topic, but the point >is that the trip count of this loop is data-dependent and the loop >doesn't necessarily operate on the same number of elements in each >iteration. It can't simply be written as a vector extension of >a scalar operation. > >Also, it's possible to rewrite this in a way that should be more >efficient in common cases. The point of the ACLE is that the >programmer has direct control over that kind of decision, rather >than leaving the control flow to the compiler. > >>> Similarly to other vector architectures (including AdvSIMD), the SVE >>> intrinsics and their types are more geared towards people who want >>> to optimise specifically for SVE without having to resort to >assembly. >>> That's an important use case for us, and I think there's always >going to >>> be a need for it alongside generic SIMD and parallel-programming >models >>> (which of course are a good thing to have too). >>> >>> Being able to use SVE features from C is also important. Not all >>> projects are prepared to convert to C++. >> >> I'd doubt that the sizeless types would find consensus in the C++ >> committee. The C committee may perhaps be more open to that, given >that >> C is more restricted and thus has to use language extensions more >often. >> >> If they don't find uptake in ISO C/C++, this will always be a >> vendor-specific thing. You seem to say that this may be okay for >you, >> but are there enough non-library-implementer developers out there >that >> would use it to justify extending the type system? > >We'd certainly like to get ACLE support into GCC and clang if possible. >It just seemed like the two ways of doing that were to get the type >system changes accepted by the standards committee or to get them >accepted as an extension by both compilers individually (similarly to >how both compilers support many GNU extensions). > >[From another message] > >> BTW, have you also looked at P0546 and P0122? > >Thanks for the pointer. I hadn't seen those before, but I'm not sure >they >would help. P0122 says: > > The span type is an abstraction that provides a view over a contiguous > sequence of objects, the storage of which is owned by some other > object. > >whereas we want the ACLE types to be self-contained types that own >the underlying storage. > >Thanks, >Richard