On Fri, Sep 16, 2016 at 11:07 AM, Kyrill Tkachov
<kyrylo.tkac...@foss.arm.com> wrote:
>
> On 16/09/16 11:05, Bin.Cheng wrote:
>>
>> On Fri, Sep 16, 2016 at 10:53 AM, Kyrill Tkachov
>> <kyrylo.tkac...@foss.arm.com> wrote:
>>>
>>> On 16/09/16 10:50, Bin.Cheng wrote:
>>>>
>>>> On Fri, Sep 16, 2016 at 10:20 AM, Kyrill Tkachov
>>>> <kyrylo.tkac...@foss.arm.com> wrote:
>>>>>
>>>>> On 16/09/16 10:02, Richard Biener wrote:
>>>>>>
>>>>>> On Fri, Sep 16, 2016 at 10:40 AM, Kyrill Tkachov
>>>>>> <kyrylo.tkac...@foss.arm.com> wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Currently the functions:
>>>>>>> int f1(int x, int t)
>>>>>>> {
>>>>>>>      if (x == -1 || x == -2)
>>>>>>>        t = 1;
>>>>>>>      return t;
>>>>>>> }
>>>>>>>
>>>>>>> int f2(int x, int t)
>>>>>>> {
>>>>>>>      if (x == -1 || x == -2)
>>>>>>>        return 1;
>>>>>>>      return t;
>>>>>>> }
>>>>>>>
>>>>>>> generate different code on AArch64 even though they have identical
>>>>>>> functionality:
>>>>>>> f1:
>>>>>>>            add     w0, w0, 2
>>>>>>>            cmp     w0, 1
>>>>>>>            csinc   w0, w1, wzr, hi
>>>>>>>            ret
>>>>>>>
>>>>>>> f2:
>>>>>>>            cmn     w0, #2
>>>>>>>            csinc   w0, w1, wzr, cc
>>>>>>>            ret
>>>>>>>
>>>>>>> The problem is that f2 performs the comparison (LTU w0 -2)
>>>>>>> whereas f1 performs (GTU (PLUS w0 2) 1). I think it is possible to
>>>>>>> simplify
>>>>>>> the f1 form
>>>>>>> to the f2 form with the simplify-rtx.c rule added in this patch. With
>>>>>>> this
>>>>>>> patch the
>>>>>>> codegen for both f1 and f2 on aarch64 at -O2 is identical (CMN,
>>>>>>> CSINC).
>>>>>>>
>>>>>>> Bootstrapped and tested on arm-none-linux-gnueabihf,
>>>>>>> aarch64-none-linux-gnu,
>>>>>>> x86_64.
>>>>>>> What do you think? Is this a correct generalisation of this issue?
>>>>>>> If so, ok for trunk?
>>>>>>
>>>>>> Do you see a difference on the GIMPLE level?  If so, this kind of
>>>>>> transform looks
>>>>>> appropriate there, too.
>>>>>
>>>>>
>>>>> The GIMPLE for the two functions looks almost identical:
>>>>> f1 (intD.7 xD.3078, intD.7 tD.3079)
>>>>> {
>>>>>     intD.7 x_4(D) = xD.3078;
>>>>>     intD.7 t_5(D) = tD.3079;
>>>>>     unsigned int x.0_1;
>>>>>     unsigned int _2;
>>>>>     x.0_1 = (unsigned int) x_4(D);
>>>>>
>>>>>     _2 = x.0_1 + 2;
>>>>>     if (_2 <= 1)
>>>>>       goto <bb 3>;
>>>>>     else
>>>>>       goto <bb 4>;
>>>>> ;;   basic block 3, loop depth 0, count 0, freq 3977, maybe hot
>>>>> ;;   basic block 4, loop depth 0, count 0, freq 10000, maybe hot
>>>>>
>>>>>     # t_3 = PHI <t_5(D)(2), 1(3)>
>>>>>     return t_3;
>>>>> }
>>>>>
>>>>> f2 (intD.7 xD.3082, intD.7 tD.3083)
>>>>> {
>>>>>     intD.7 x_4(D) = xD.3082;
>>>>>     intD.7 t_5(D) = tD.3083;
>>>>>     unsigned int x.1_1;
>>>>>     unsigned int _2;
>>>>>     intD.7 _3;
>>>>>
>>>>>     x.1_1 = (unsigned int) x_4(D);
>>>>>
>>>>>     _2 = x.1_1 + 2;
>>>>>     if (_2 <= 1)
>>>>>       goto <bb 4>;
>>>>>     else
>>>>>       goto <bb 3>;
>>>>>
>>>>> ;;   basic block 3, loop depth 0, count 0, freq 6761, maybe hot
>>>>> ;;   basic block 4, loop depth 0, count 0, freq 10000, maybe hot
>>>>>     # _3 = PHI <1(2), t_5(D)(3)>
>>>>>     return _3;
>>>>>
>>>>> }
>>>>>
>>>>> So at GIMPLE level we see a (x + 2 <=u 1) in both cases but with
>>>>> slightly
>>>>> different CFG.  RTL-level transformations (ce1) bring it to the
>>>>> pre-combine
>>>>> RTL
>>>>> where one does (LTU w0 -2) and the other does (GTU (PLUS w0 2) 1).
>>>>>
>>>>> So the differences start at RTL level, so I think we need this
>>>>> transformation there.
>>>>> However, for the testcase:
>>>>> unsigned int
>>>>> foo (unsigned int a, unsigned int b)
>>>>> {
>>>>>     return (a + 2) > 1;
>>>>> }
>>>>>
>>>>> The differences do appear at GIMPLE level, so I think a match.pd
>>>>> pattern
>>>>> would help here.
>>>>
>>>> Hi, may I ask what the function looks like to which this one is
>>>> different
>>>> to?
>>>
>>>
>>> Hi Bin,
>>> I meant to say that the unsigned greater than comparison is retained at
>>> the
>>> GIMPLE level
>>> so could be optimised there.
>>
>> In this case, the resulting gimple code refers to a huge unsigned
>> constant.  It's target dependent if that constant can be encoded.
>> AArch64 has CMN to do that, not sure what other targets' case.  And
>> AArch64 only supports small range of such constants.  May be better to
>> leave it for RTL where we know better if result code is optimal.
>
>
> Well, we are saving a PLUS operation, so the resulting GIMPLE is simpler
Ah, yes, right.

Thanks,
bin

Reply via email to