On Fri, Sep 16, 2016 at 11:07 AM, Kyrill Tkachov <kyrylo.tkac...@foss.arm.com> wrote: > > On 16/09/16 11:05, Bin.Cheng wrote: >> >> On Fri, Sep 16, 2016 at 10:53 AM, Kyrill Tkachov >> <kyrylo.tkac...@foss.arm.com> wrote: >>> >>> On 16/09/16 10:50, Bin.Cheng wrote: >>>> >>>> On Fri, Sep 16, 2016 at 10:20 AM, Kyrill Tkachov >>>> <kyrylo.tkac...@foss.arm.com> wrote: >>>>> >>>>> On 16/09/16 10:02, Richard Biener wrote: >>>>>> >>>>>> On Fri, Sep 16, 2016 at 10:40 AM, Kyrill Tkachov >>>>>> <kyrylo.tkac...@foss.arm.com> wrote: >>>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> Currently the functions: >>>>>>> int f1(int x, int t) >>>>>>> { >>>>>>> if (x == -1 || x == -2) >>>>>>> t = 1; >>>>>>> return t; >>>>>>> } >>>>>>> >>>>>>> int f2(int x, int t) >>>>>>> { >>>>>>> if (x == -1 || x == -2) >>>>>>> return 1; >>>>>>> return t; >>>>>>> } >>>>>>> >>>>>>> generate different code on AArch64 even though they have identical >>>>>>> functionality: >>>>>>> f1: >>>>>>> add w0, w0, 2 >>>>>>> cmp w0, 1 >>>>>>> csinc w0, w1, wzr, hi >>>>>>> ret >>>>>>> >>>>>>> f2: >>>>>>> cmn w0, #2 >>>>>>> csinc w0, w1, wzr, cc >>>>>>> ret >>>>>>> >>>>>>> The problem is that f2 performs the comparison (LTU w0 -2) >>>>>>> whereas f1 performs (GTU (PLUS w0 2) 1). I think it is possible to >>>>>>> simplify >>>>>>> the f1 form >>>>>>> to the f2 form with the simplify-rtx.c rule added in this patch. With >>>>>>> this >>>>>>> patch the >>>>>>> codegen for both f1 and f2 on aarch64 at -O2 is identical (CMN, >>>>>>> CSINC). >>>>>>> >>>>>>> Bootstrapped and tested on arm-none-linux-gnueabihf, >>>>>>> aarch64-none-linux-gnu, >>>>>>> x86_64. >>>>>>> What do you think? Is this a correct generalisation of this issue? >>>>>>> If so, ok for trunk? >>>>>> >>>>>> Do you see a difference on the GIMPLE level? If so, this kind of >>>>>> transform looks >>>>>> appropriate there, too. >>>>> >>>>> >>>>> The GIMPLE for the two functions looks almost identical: >>>>> f1 (intD.7 xD.3078, intD.7 tD.3079) >>>>> { >>>>> intD.7 x_4(D) = xD.3078; >>>>> intD.7 t_5(D) = tD.3079; >>>>> unsigned int x.0_1; >>>>> unsigned int _2; >>>>> x.0_1 = (unsigned int) x_4(D); >>>>> >>>>> _2 = x.0_1 + 2; >>>>> if (_2 <= 1) >>>>> goto <bb 3>; >>>>> else >>>>> goto <bb 4>; >>>>> ;; basic block 3, loop depth 0, count 0, freq 3977, maybe hot >>>>> ;; basic block 4, loop depth 0, count 0, freq 10000, maybe hot >>>>> >>>>> # t_3 = PHI <t_5(D)(2), 1(3)> >>>>> return t_3; >>>>> } >>>>> >>>>> f2 (intD.7 xD.3082, intD.7 tD.3083) >>>>> { >>>>> intD.7 x_4(D) = xD.3082; >>>>> intD.7 t_5(D) = tD.3083; >>>>> unsigned int x.1_1; >>>>> unsigned int _2; >>>>> intD.7 _3; >>>>> >>>>> x.1_1 = (unsigned int) x_4(D); >>>>> >>>>> _2 = x.1_1 + 2; >>>>> if (_2 <= 1) >>>>> goto <bb 4>; >>>>> else >>>>> goto <bb 3>; >>>>> >>>>> ;; basic block 3, loop depth 0, count 0, freq 6761, maybe hot >>>>> ;; basic block 4, loop depth 0, count 0, freq 10000, maybe hot >>>>> # _3 = PHI <1(2), t_5(D)(3)> >>>>> return _3; >>>>> >>>>> } >>>>> >>>>> So at GIMPLE level we see a (x + 2 <=u 1) in both cases but with >>>>> slightly >>>>> different CFG. RTL-level transformations (ce1) bring it to the >>>>> pre-combine >>>>> RTL >>>>> where one does (LTU w0 -2) and the other does (GTU (PLUS w0 2) 1). >>>>> >>>>> So the differences start at RTL level, so I think we need this >>>>> transformation there. >>>>> However, for the testcase: >>>>> unsigned int >>>>> foo (unsigned int a, unsigned int b) >>>>> { >>>>> return (a + 2) > 1; >>>>> } >>>>> >>>>> The differences do appear at GIMPLE level, so I think a match.pd >>>>> pattern >>>>> would help here. >>>> >>>> Hi, may I ask what the function looks like to which this one is >>>> different >>>> to? >>> >>> >>> Hi Bin, >>> I meant to say that the unsigned greater than comparison is retained at >>> the >>> GIMPLE level >>> so could be optimised there. >> >> In this case, the resulting gimple code refers to a huge unsigned >> constant. It's target dependent if that constant can be encoded. >> AArch64 has CMN to do that, not sure what other targets' case. And >> AArch64 only supports small range of such constants. May be better to >> leave it for RTL where we know better if result code is optimal. > > > Well, we are saving a PLUS operation, so the resulting GIMPLE is simpler Ah, yes, right.
Thanks, bin