On Thu, Oct 20, 2016 at 6:32 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: > On Wed, Oct 12, 2016 at 9:50 AM, Richard Biener > <richard.guent...@gmail.com> wrote: >> On Wed, Oct 12, 2016 at 10:29 AM, Bin.Cheng <amker.ch...@gmail.com> wrote: >>> On Wed, Oct 12, 2016 at 9:12 AM, Richard Biener >>> <richard.guent...@gmail.com> wrote: >>>> On Tue, Oct 11, 2016 at 5:03 PM, Bin Cheng <bin.ch...@arm.com> wrote: >>>>> Hi, >>>>> Given below test case, >>>>> int foo (unsigned short a[], unsigned int x) >>>>> { >>>>> unsigned int i; >>>>> for (i = 0; i < 1000; i++) >>>>> { >>>>> x = a[i]; >>>>> a[i] = (unsigned short)(x >= 32768 ? x - 32768 : 0); >>>>> } >>>>> return x; >>>>> } >>>>> >>>>> it now can be vectorized on AArch64, but generated assembly is way from >>>>> optimal: >>>>> .L4: >>>>> ldr q4, [x3, x1] >>>>> add w2, w2, 1 >>>>> cmp w2, w0 >>>>> ushll v1.4s, v4.4h, 0 >>>>> ushll2 v0.4s, v4.8h, 0 >>>>> add v3.4s, v1.4s, v6.4s >>>>> add v2.4s, v0.4s, v6.4s >>>>> cmhi v1.4s, v1.4s, v5.4s >>>>> cmhi v0.4s, v0.4s, v5.4s >>>>> and v1.16b, v3.16b, v1.16b >>>>> and v0.16b, v2.16b, v0.16b >>>>> xtn v2.4h, v1.4s >>>>> xtn2 v2.8h, v0.4s >>>>> str q2, [x3, x1] >>>>> add x1, x1, 16 >>>>> bcc .L4 >>>>> >>>>> The vectorized loop has 15 instructions, which can be greatly simplified >>>>> by turning cond_expr into max_expr, as below: >>>>> .L4: >>>>> ldr q1, [x3, x1] >>>>> add w2, w2, 1 >>>>> cmp w2, w0 >>>>> umax v0.8h, v1.8h, v2.8h >>>>> add v0.8h, v0.8h, v2.8h >>>>> str q0, [x3, x1] >>>>> add x1, x1, 16 >>>>> bcc .L4 >>>>> >>>>> This patch addresses the issue by adding new vectorization pattern. >>>>> Bootstrap and test on x86_64 and AArch64. Is it OK? >>>> >>>> So the COND_EXPRs are generated this way by if-conversion, right? I >>> Though ?: is used in source code, yes, it is if-conv regenerated COND_EXPR. >>>> believe that >>>> the MAX/MIN_EXPR form is always preferrable and thus it looks like >>>> if-conversion >>>> might want to either directly generate it or make sure to fold the >>>> introduced stmts >>>> (and have a match.pd pattern catching this). >>> Hmm, I also noticed saturation cases which should be better >>> transformed before vectorization in scalar optimizers. But this case >>> is a bit different because there is additional computation involved >>> other than type conversion. We need to prove the computation can be >>> done in either large or small types. It is quite specific case and I >>> don't see good (general) solution in if-conv. Vect-pattern looks like >>> a natural place doing this. I am also looking at general saturation >>> cases, but this one is different? >> >> (vect-patterns should go away ...) >> >> But as if-conversion results may also prevail for scalar code doing the >> pattern in match.pd would be better - that is, "apply" the pattern >> already during if-conversion. >> >> Yes, if-conversion fails to fold the stmts it generates (it only uses >> generic folding on the trees it builds - it can need some TLC here). > Hi, > Sorry for being slow in replying, I looked into match.pd and can > transform simpler cond_expr into minmax expr successfully, but this > one is more complicated. It transforms 3 gimple statements into 2 > result statements, but result of match&simplify pattern is an > expression. How should I write the pattern outputing two gimple > statement as result? Hmm, now I see the transform looks more like > gimple combine...
You simply write a larger expression. match&simplify happily creates two gimple stmts from a (plus (minus @1 @2) @3) result. Richard. > Thanks, > bin