Thanks Tamar and Jeff for comments.
> I'm not sure it's that simple. It'll depend on the micro-architecture.
> So things like strength of the branch predictors, how fetch blocks are
> handled (can you have embedded not-taken branches, short-forward-branch
> optimizations, etc).
> After:
>
> .L.sat_add_u_1(unsigned int, unsigned int):
> add 4,3,4
> rldicl 9,4,0,32
> subf 3,3,9
> sradi 3,3,63
> or 3,3,4
> rldicl 3,3,0,32
> blr
>
> and before
>
> .L.sat_add_u_1(unsigned int, unsigned int):
> add 4,3,4
> cmplw 0,4,3
> bge 0,.L2
> li 4,-1
> .L2:
> rldicl 3,4,0,32
> blr
I am not familiar with branch prediction, but the branch should be 50% token
and 50% not-token
according to the range of sat add input. It is the worst case for branch
prediction? I mean if we call
100 times with token, not-token, token, not-token... sequence, the branch
version will be still faster?
Feel free to correct me if I'm wrong.
Back to these 16 forms of sat add as below, is there any suggestion which one
or two form(s) may be
cheaper than others from the perspective of gimple IR? Independent with the
backend implemented SAT_ADD or not.
#define DEF_SAT_U_ADD_1(T) \
T sat_u_add_##T##_1 (T x, T y) \
{ \
return (T)(x + y) >= x ? (x + y) : -1; \
}
#define DEF_SAT_U_ADD_2(T) \
T sat_u_add_##T##_2 (T x, T y) \
{ \
return (T)(x + y) < x ? -1 : (x + y); \
}
#define DEF_SAT_U_ADD_3(T) \
T sat_u_add_##T##_3 (T x, T y) \
{ \
return x <= (T)(x + y) ? (x + y) : -1; \
}
#define DEF_SAT_U_ADD_4(T) \
T sat_u_add_##T##_4 (T x, T y) \
{ \
return x > (T)(x + y) ? -1 : (x + y); \
}
#define DEF_SAT_U_ADD_5(T) \
T sat_u_add_##T##_1 (T x, T y) \
{ \
if ((T)(x + y) >= x) \
return x + y; \
else \
return -1; \
}
#define DEF_SAT_U_ADD_6(T) \
T sat_u_add_##T##_6 (T x, T y) \
{ \
if ((T)(x + y) < x) \
return -1; \
else \
return x + y; \
}
#define DEF_SAT_U_ADD_7(T) \
T sat_u_add_##T##_7 (T x, T y) \
{ \
if (x <= (T)(x + y)) \
return x + y; \
else \
return -1; \
}
#define DEF_SAT_U_ADD_8(T) \
T sat_u_add_##T##_8 (T x, T y) \
{ \
if (x > (T)(x + y)) \
return -1; \
else \
return x + y; \
}
#define DEF_SAT_U_ADD_9(T) \
T sat_u_add_##T##_9 (T x, T y) \
{ \
T ret; \
return __builtin_add_overflow (x, y, &ret) == 0 ? ret : - 1; \
}
#define DEF_SAT_U_ADD_10(T) \
T sat_u_add_##T##_10 (T x, T y) \
{ \
T ret; \
return !__builtin_add_overflow (x, y, &ret) ? ret : - 1; \
}
#define DEF_SAT_U_ADD_11(T) \
T sat_u_add_##T##_11 (T x, T y) \
{ \
T ret; \
if (__builtin_add_overflow (x, y, &ret) == 0) \
return ret; \
else \
return -1; \
}
#define DEF_SAT_U_ADD_12(T) \
T sat_u_add_##T##_12 (T x, T y) \
{ \
T ret; \
if (!__builtin_add_overflow (x, y, &ret)) \
return ret; \
else \
return -1; \
}
#define DEF_SAT_U_ADD_13(T) \
T sat_u_add_##T##_13 (T x, T y) \
{ \
T ret; \
return __builtin_add_overflow (x, y, &ret) != 0 ? -1 : ret; \
}
#define DEF_SAT_U_ADD_14(T) \
T sat_u_add_##T##_14 (T x, T y) \
{ \
T ret; \
return __builtin_add_overflow (x, y, &ret) ? -1 : ret; \
}
#define DEF_SAT_U_ADD_15(T) \
T sat_u_add_##T##_15 (T x, T y) \
{ \
T ret; \
if (__builtin_add_overflow (x, y, &ret) != 0) \
return -1; \
else \
return ret; \
}
#define DEF_SAT_U_ADD_16(T) \
T sat_u_add_##T##_16 (T x, T y) \
{ \
T ret; \
if (__builtin_add_overflow (x, y, &ret)) \
return -1; \
else \
return ret; \
}
Pan
-----Original Message-----
From: Jeff Law <[email protected]>
Sent: Friday, November 8, 2024 4:08 AM
To: Tamar Christina <[email protected]>; Li, Pan2 <[email protected]>;
Richard Biener <[email protected]>
Cc: [email protected]; [email protected]; [email protected];
[email protected]
Subject: Re: [PATCH v2 01/10] Match: Simplify branch form 4 of unsigned SAT_ADD
into branchless
On 11/7/24 8:07 AM, Tamar Christina wrote:
>
>> -----Original Message-----
>> From: Li, Pan2 <[email protected]>
>> Sent: Thursday, November 7, 2024 12:57 PM
>> To: Tamar Christina <[email protected]>; Richard Biener
>> <[email protected]>
>> Cc: [email protected]; [email protected]; [email protected];
>> [email protected]; [email protected]
>> Subject: RE: [PATCH v2 01/10] Match: Simplify branch form 4 of unsigned
>> SAT_ADD into branchless
>>
>> I see your point that the backend can leverage condition move to emit the
>> branch
>> code.
>>
>>> For instance see https://godbolt.org/z/fvrq3aq6K
>>> On ISAs with conditional operations the branch version gets ifconverted.
>>> On AArch64 we get:
>>> sat_add_u_1(unsigned int, unsigned int):
>>> adds w0, w0, w1
>>> csinv w0, w0, wzr, cc
>>> ret
>>> so just 2 instructions, and also branchless. On x86_64 we get:
>>> sat_add_u_1(unsigned int, unsigned int):
>>> add edi, esi
>>> mov eax, -1
>>> cmovnc eax, edi
>>> ret
>>> so 3 instructions but a dependency chain of 2.
>>> also branchless. This patch would regress both of these.
>>
>> But the above Godbolt may not be a good example for evidence, because both
>> x86_64 and aarch64 implemented usadd
>> already.
>> Thus, they all go to usadd<QImode>. For example as below, the sat_add_u_1 and
>> sat_add_u_2 are almost the
>> same when the backend implemented usadd.
>>
>> #include <stdint.h>
>>
>> #define T uint32_t
>>
>> T sat_add_u_1 (T x, T y)
>> {
>> return (T)(x + y) < x ? -1 : (x + y);
>> }
>>
>> T sat_add_u_2 (T x, T y)
>> {
>> return (x + y) | -((x + y) < x);
>> }
>>
>> It will become different when take gcc 14.2 (which doesn’t have .SAT_ADD
>> GIMPLE
>> IR), the x86_64
>> will have below asm dump for -O3. Looks like no obvious difference here.
>>
>> sat_add_u_1(unsigned int, unsigned int):
>> add edi, esi
>> mov eax, -1
>> cmovnc eax, edi
>> ret
>>
>> sat_add_u_2(unsigned int, unsigned int):
>> add edi, esi
>> sbb eax, eax
>> or eax, edi
>> ret
>>
>
> Because CE is able to recognize the idiom back into a conditional move.
> Pick a target that doesn't have conditional instructions, like PowerPC
> https://godbolt.org/z/4bTv18WMv
>
> You'll see that this canonicalization has made codegen worse.
>
> After:
>
> .L.sat_add_u_1(unsigned int, unsigned int):
> add 4,3,4
> rldicl 9,4,0,32
> subf 3,3,9
> sradi 3,3,63
> or 3,3,4
> rldicl 3,3,0,32
> blr
>
> and before
>
> .L.sat_add_u_1(unsigned int, unsigned int):
> add 4,3,4
> cmplw 0,4,3
> bge 0,.L2
> li 4,-1
> .L2:
> rldicl 3,4,0,32
> blr
>
> It means now it always has to execute 6 instructions, whereas before it was 4
> or 5 depending
> on the order of the branch. So for those architectures, it's always slower.
I'm not sure it's that simple. It'll depend on the micro-architecture.
So things like strength of the branch predictors, how fetch blocks are
handled (can you have embedded not-taken branches, short-forward-branch
optimizations, etc).
Jeff