On Mon, Jan 21, 2019 at 2:20 PM Anton Youdkevitch <anton.youdkevi...@bell-sw.com> wrote: > > Here is the prototype for doing vectorized reduction > using SLP approach. I would appreciate feedback if this > is a feasible approach and if overall the direction is > right. > > The idea is to vectorize reduction like this > > S = A[0]+A[1]+...A[N]; > > into > > Sv = Av[0]+Av[1]+...+Av[N/VL]; > > > So that, for instance, the following code: > > typedef double T; > T sum; > > void foo (T* __restrict__ a) > { > sum = a[0]+ a[1] + a[2]+ a[3] + a[4]+ a[5] + a[6]+ a[7]; > } > > > instead of: > > foo: > .LFB23: > .cfi_startproc > movsd (%rdi), %xmm0 > movsd 16(%rdi), %xmm1 > addsd 8(%rdi), %xmm0 > addsd 24(%rdi), %xmm1 > addsd %xmm1, %xmm0 > movsd 32(%rdi), %xmm1 > addsd 40(%rdi), %xmm1 > addsd %xmm1, %xmm0 > movsd 48(%rdi), %xmm1 > addsd 56(%rdi), %xmm1 > addsd %xmm1, %xmm0 > movsd %xmm0, sum2(%rip) > ret > .cfi_endproc > > > be compiled into: > > foo: > .LFB11: > .cfi_startproc > movupd 32(%rdi), %xmm0 > movupd 48(%rdi), %xmm3 > movupd (%rdi), %xmm1 > movupd 16(%rdi), %xmm2 > addpd %xmm3, %xmm0 > addpd %xmm2, %xmm1 > addpd %xmm1, %xmm0 > haddpd %xmm0, %xmm0 > movlpd %xmm0, sum(%rip) > ret > .cfi_endproc > > > As this is a very crude prototype there are some things > to consider. > > 1. As the current SLP framework assumes presence of > group stores I cannot use directly it as reduction > does not require group stores (or even stores at all), > so, I'm partially using the existing functionality but > sometimes I have to create a stripped down version > of it for my own needs; > > 2. The current version considers only PLUS reduction > as it is encountered most often and therefore is the > most practical; > > 3. While normally SLP transformation should operate > inside single basic block this requirement greatly > restricts it's practical application as in a code > complex enough there will be vectorizable subexpressions > defined in basic block(s) different from that where the > reduction result resides. However, for the sake of > simplicity only single uses in the same block are > considered now; > > 4. For the same sake the current version does not deal > with partial reductions which would require partial sum > merging and careful removal of the scalars that participate > in the vector part. The latter gets done automatically > by DCE in the case of full reduction vectorization; > > 5. There is no cost model yet for the reasons mentioned > in the paragraphs 3 and 4.
First sorry for the late response. No, I don't think your approach of bypassing the "rest" is OK. I've once started to implement BB reduction support but somehow got distracted IIRC. Your testcase (and the prototype I worked on) still has a (scalar non-grouped) store which can be keyed on in SLP discovery phase. You should be able to "re-use" (by a lot of refactoring I guess) the reduction finding code (vect_is_slp_reduction) to see wheter such a store is fed by a reduction chain. Note that if you adjust the testcase to have sum[0] = a[0] + ... + a[n]; sum[1] = b[0] + ... + b[n]; then you'll have a grouped store fed by reductions. You can also consider for (i = ...) { sum[i] = a[i*4] + a[i*4+1] + a[i*4+2] + a[i*4+3]; } which we should be able to handle. So the whole problem of doing BB reductions boils down to SLP tree discovery, the rest should be more straight-forward (of course code-gen needs to be adapted for the non-loop case as well). It's not the easiest problem you try to tackle btw ;) May I suggest you become familiar with the code by BB vectorizing vector CONSTRUCTORs instead? typedef int v4si __attribute__((vector_size(16))); v4si foo (int *i, *j) { return (v4si) { i[0] + j[0], i[1] + j[1], i[2] + j[2], i[3] + j[3] }; } it has the same SLP discovery "issue", this time somewhat easier as a CONSTRUCTOR directly plays the role of the "grouped store". Richard. > Thanks in advance. > > -- > Anton