Thank you for your detailed explanation. Once GCC detects a reduction operation, it will automatically accumulate all elements in the vector after the loop. In the loop the reduction variable is always a vector whose elements are reductions of corresponding values from other vectors. Therefore in your case the only instruction you need to generate is:
VABAL ops[3], ops[1], ops[2] It is OK if you accumulate the elements into one in the vector inside of the loop (if one instruction can do this), but you have to make sure other elements in the vector should remain zero so that the final result is correct. If you are confused about the documentation, check the one for udot_prod (just above usad in md.texi), as it has very similar behavior as usad. Actually I copied the text from there and did some changes. As those two instruction patterns are both for vectorization, their behavior should not be difficult to explain. If you have more questions or think that the documentation is still improper please let me know. Thank you very much! Cong On Tue, Nov 5, 2013 at 1:53 AM, James Greenhalgh <james.greenha...@arm.com> wrote: > On Mon, Nov 04, 2013 at 06:30:55PM +0000, Cong Hou wrote: >> On Mon, Nov 4, 2013 at 2:06 AM, James Greenhalgh >> <james.greenha...@arm.com> wrote: >> > On Fri, Nov 01, 2013 at 04:48:53PM +0000, Cong Hou wrote: >> >> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi >> >> index 2a5a2e1..8f5d39a 100644 >> >> --- a/gcc/doc/md.texi >> >> +++ b/gcc/doc/md.texi >> >> @@ -4705,6 +4705,16 @@ wider mode, is computed and added to operand 3. >> >> Operand 3 is of a mode equal or >> >> wider than the mode of the product. The result is placed in operand 0, >> >> which >> >> is of the same mode as operand 3. >> >> >> >> +@cindex @code{ssad@var{m}} instruction pattern >> >> +@item @samp{ssad@var{m}} >> >> +@cindex @code{usad@var{m}} instruction pattern >> >> +@item @samp{usad@var{m}} >> >> +Compute the sum of absolute differences of two signed/unsigned elements. >> >> +Operand 1 and operand 2 are of the same mode. Their absolute difference, >> >> which >> >> +is of a wider mode, is computed and added to operand 3. Operand 3 is of >> >> a mode >> >> +equal or wider than the mode of the absolute difference. The result is >> >> placed >> >> +in operand 0, which is of the same mode as operand 3. >> >> + >> >> @cindex @code{ssum_widen@var{m3}} instruction pattern >> >> @item @samp{ssum_widen@var{m3}} >> >> @cindex @code{usum_widen@var{m3}} instruction pattern >> >> diff --git a/gcc/expr.c b/gcc/expr.c >> >> index 4975a64..1db8a49 100644 >> > >> > I'm not sure I follow, and if I do - I don't think it matches what >> > you have implemented for i386. >> > >> > From your text description I would guess the series of operations to be: >> > >> > v1 = widen (operands[1]) >> > v2 = widen (operands[2]) >> > v3 = abs (v1 - v2) >> > operands[0] = v3 + operands[3] >> > >> > But if I understand the behaviour of PSADBW correctly, what you have >> > actually implemented is: >> > >> > v1 = widen (operands[1]) >> > v2 = widen (operands[2]) >> > v3 = abs (v1 - v2) >> > v4 = reduce_plus (v3) >> > operands[0] = v4 + operands[3] >> > >> > To my mind, synthesizing the reduce_plus step will be wasteful for targets >> > who do not get this for free with their Absolute Difference step. Imagine a >> > simple loop where we have synthesized the reduce_plus, we compute partial >> > sums each loop iteration, though we would be better to leave the >> > reduce_plus >> > step until after the loop. "REDUC_PLUS_EXPR" would be the appropriate >> > Tree code for this. >> >> What do you mean when you use "synthesizing" here? For each pattern, >> the only synthesized operation is the one being returned from the >> pattern recognizer. In this case, it is USAD_EXPR. The recognition of >> reduce sum is necessary as we need corresponding prolog and epilog for >> reductions, which is already done before pattern recognition. Note >> that reduction is not a pattern but is a type of vector definition. A >> vectorization pattern can still be a reduction operation as long as >> STMT_VINFO_RELATED_STMT of this pattern is a reduction operation. You >> can check the other two reduction patterns: widen_sum_pattern and >> dot_prod_pattern for reference. > > My apologies for not being clear. What I mean is, for a target which does > not have a dedicated PSADBW instruction, the individual steps of > 'usad<m>' must be "synthesized" in such a way as to match the expected > behaviour of the tree code. > > So, I must expand 'usadm' to a series of equivalent instructions > as USAD_EXPR expects. > > If USAD_EXPR requires me to emit a reduction on each loop iteration, > I think that will be inefficient compared to performing the reduction > after the loop body. > > To a first approximation on ARM, I would expect from your description > of 'usad<m>' that generating, > > VABAL ops[3], ops[1], ops[2] > (Vector widening Absolute Difference and Accumulate) > > would fulfil the requirements. > > But to match the behaviour you have implemented in the i386 > backend I would be required to generate: > > VABAL ops[3], ops[1], ops[2] > VPADD ops[3], ops[3], ops[3] (add one set of pairs) > VPADD ops[3], ops[3], ops[3] (and the other) > VAND ops[0], ops[3], MASK (clear high lanes) > > Which additionally performs the (redundant) vector reduction > and high lane zeroing step on each loop iteration. > > My comment is that your documentation and implementation are > inconsistent so I am not sure which behaviour you intend for USAD_EXPR. > > Additionally, I think it would be more generic to choose the first > behaviour, rather than requiring a wasteful decomposition to match > a very particular i386 opcode. > > Thanks, > James >