Re: [PATCH][simplify-rtx] (GTU (PLUS a C) (C - 1)) --> (LTU a -C)

Kyrill Tkachov Fri, 16 Sep 2016 03:08:46 -0700


On 16/09/16 11:05, Bin.Cheng wrote:

On Fri, Sep 16, 2016 at 10:53 AM, Kyrill Tkachov
<[email protected]> wrote:

On 16/09/16 10:50, Bin.Cheng wrote:

On Fri, Sep 16, 2016 at 10:20 AM, Kyrill Tkachov
<[email protected]> wrote:

On 16/09/16 10:02, Richard Biener wrote:

On Fri, Sep 16, 2016 at 10:40 AM, Kyrill Tkachov
<[email protected]> wrote:

Hi all,


Currently the functions:
int f1(int x, int t)
{
     if (x == -1 || x == -2)
       t = 1;
     return t;
}

int f2(int x, int t)
{
     if (x == -1 || x == -2)
       return 1;
     return t;
}

generate different code on AArch64 even though they have identical
functionality:
f1:
           add     w0, w0, 2
           cmp     w0, 1
           csinc   w0, w1, wzr, hi
           ret

f2:
           cmn     w0, #2
           csinc   w0, w1, wzr, cc
           ret

The problem is that f2 performs the comparison (LTU w0 -2)
whereas f1 performs (GTU (PLUS w0 2) 1). I think it is possible to
simplify
the f1 form
to the f2 form with the simplify-rtx.c rule added in this patch. With
this
patch the
codegen for both f1 and f2 on aarch64 at -O2 is identical (CMN, CSINC).

Bootstrapped and tested on arm-none-linux-gnueabihf,
aarch64-none-linux-gnu,
x86_64.
What do you think? Is this a correct generalisation of this issue?
If so, ok for trunk?

Do you see a difference on the GIMPLE level?  If so, this kind of
transform looks
appropriate there, too.


The GIMPLE for the two functions looks almost identical:
f1 (intD.7 xD.3078, intD.7 tD.3079)
{
    intD.7 x_4(D) = xD.3078;
    intD.7 t_5(D) = tD.3079;
    unsigned int x.0_1;
    unsigned int _2;
    x.0_1 = (unsigned int) x_4(D);

    _2 = x.0_1 + 2;
    if (_2 <= 1)
      goto <bb 3>;
    else
      goto <bb 4>;
;;   basic block 3, loop depth 0, count 0, freq 3977, maybe hot
;;   basic block 4, loop depth 0, count 0, freq 10000, maybe hot

    # t_3 = PHI <t_5(D)(2), 1(3)>
    return t_3;
}

f2 (intD.7 xD.3082, intD.7 tD.3083)
{
    intD.7 x_4(D) = xD.3082;
    intD.7 t_5(D) = tD.3083;
    unsigned int x.1_1;
    unsigned int _2;
    intD.7 _3;

    x.1_1 = (unsigned int) x_4(D);

    _2 = x.1_1 + 2;
    if (_2 <= 1)
      goto <bb 4>;
    else
      goto <bb 3>;

;;   basic block 3, loop depth 0, count 0, freq 6761, maybe hot
;;   basic block 4, loop depth 0, count 0, freq 10000, maybe hot
    # _3 = PHI <1(2), t_5(D)(3)>
    return _3;

}

So at GIMPLE level we see a (x + 2 <=u 1) in both cases but with slightly
different CFG.  RTL-level transformations (ce1) bring it to the
pre-combine
RTL
where one does (LTU w0 -2) and the other does (GTU (PLUS w0 2) 1).

So the differences start at RTL level, so I think we need this
transformation there.
However, for the testcase:
unsigned int
foo (unsigned int a, unsigned int b)
{
    return (a + 2) > 1;
}

The differences do appear at GIMPLE level, so I think a match.pd pattern
would help here.

Hi, may I ask what the function looks like to which this one is different
to?


Hi Bin,
I meant to say that the unsigned greater than comparison is retained at the
GIMPLE level
so could be optimised there.

In this case, the resulting gimple code refers to a huge unsigned
constant.  It's target dependent if that constant can be encoded.
AArch64 has CMN to do that, not sure what other targets' case.  And
AArch64 only supports small range of such constants.  May be better to
leave it for RTL where we know better if result code is optimal.


Well, we are saving a PLUS operation, so the resulting GIMPLE is simpler IMO,
which is match.pd's goal.

Thanks,
Kyrill

Thanks,
bin

Re: [PATCH][simplify-rtx] (GTU (PLUS a C) (C - 1)) --> (LTU a -C)

Reply via email to