minus

Richard Biener Fri, 21 Oct 2016 00:41:28 -0700

On Thu, Oct 20, 2016 at 6:32 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> On Wed, Oct 12, 2016 at 9:50 AM, Richard Biener
> <richard.guent...@gmail.com> wrote:
>> On Wed, Oct 12, 2016 at 10:29 AM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>> On Wed, Oct 12, 2016 at 9:12 AM, Richard Biener
>>> <richard.guent...@gmail.com> wrote:
>>>> On Tue, Oct 11, 2016 at 5:03 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>>> Hi,
>>>>> Given below test case,
>>>>> int foo (unsigned short a[], unsigned int x)
>>>>> {
>>>>>   unsigned int i;
>>>>>   for (i = 0; i < 1000; i++)
>>>>>     {
>>>>>       x = a[i];
>>>>>       a[i] = (unsigned short)(x >= 32768 ? x - 32768 : 0);
>>>>>     }
>>>>>   return x;
>>>>> }
>>>>>
>>>>> it now can be vectorized on AArch64, but generated assembly is way from 
>>>>> optimal:
>>>>> .L4:
>>>>>         ldr     q4, [x3, x1]
>>>>>         add     w2, w2, 1
>>>>>         cmp     w2, w0
>>>>>         ushll   v1.4s, v4.4h, 0
>>>>>         ushll2  v0.4s, v4.8h, 0
>>>>>         add     v3.4s, v1.4s, v6.4s
>>>>>         add     v2.4s, v0.4s, v6.4s
>>>>>         cmhi    v1.4s, v1.4s, v5.4s
>>>>>         cmhi    v0.4s, v0.4s, v5.4s
>>>>>         and     v1.16b, v3.16b, v1.16b
>>>>>         and     v0.16b, v2.16b, v0.16b
>>>>>         xtn     v2.4h, v1.4s
>>>>>         xtn2    v2.8h, v0.4s
>>>>>         str     q2, [x3, x1]
>>>>>         add     x1, x1, 16
>>>>>         bcc     .L4
>>>>>
>>>>> The vectorized loop has 15 instructions, which can be greatly simplified 
>>>>> by turning cond_expr into max_expr, as below:
>>>>> .L4:
>>>>>         ldr     q1, [x3, x1]
>>>>>         add     w2, w2, 1
>>>>>         cmp     w2, w0
>>>>>         umax    v0.8h, v1.8h, v2.8h
>>>>>         add     v0.8h, v0.8h, v2.8h
>>>>>         str     q0, [x3, x1]
>>>>>         add     x1, x1, 16
>>>>>         bcc     .L4
>>>>>
>>>>> This patch addresses the issue by adding new vectorization pattern.
>>>>> Bootstrap and test on x86_64 and AArch64.  Is it OK?
>>>>
>>>> So the COND_EXPRs are generated this way by if-conversion, right?  I
>>> Though ?: is used in source code, yes, it is if-conv regenerated COND_EXPR.
>>>> believe that
>>>> the MAX/MIN_EXPR form is always preferrable and thus it looks like 
>>>> if-conversion
>>>> might want to either directly generate it or make sure to fold the
>>>> introduced stmts
>>>> (and have a match.pd pattern catching this).
>>> Hmm, I also noticed saturation cases which should be better
>>> transformed before vectorization in scalar optimizers.  But this case
>>> is a bit different because there is additional computation involved
>>> other than type conversion.  We need to prove the computation can be
>>> done in either large or small types.  It is quite specific case and I
>>> don't see good (general) solution in if-conv.  Vect-pattern looks like
>>> a natural place doing this.  I am also looking at general saturation
>>> cases, but this one is different?
>>
>> (vect-patterns should go away ...)
>>
>> But as if-conversion results may also prevail for scalar code doing the
>> pattern in match.pd would be better - that is, "apply" the pattern
>> already during if-conversion.
>>
>> Yes, if-conversion fails to fold the stmts it generates (it only uses
>> generic folding on the trees it builds - it can need some TLC here).
> Hi,
> Sorry for being slow in replying, I looked into match.pd and can
> transform simpler cond_expr into minmax expr successfully, but this
> one is more complicated.  It transforms 3 gimple statements into 2
> result statements, but result of match&simplify pattern is an
> expression.  How should I write the pattern outputing two gimple
> statement as result?  Hmm, now I see the transform looks more like
> gimple combine...


You simply write a larger expression.  match&simplify happily
creates two gimple stmts from a (plus (minus @1 @2) @3) result.

Richard.

> Thanks,
> bin

Re: [PATCH GCC]New vectorization pattern turning cond_expr into max/min and plus/minus

Reply via email to