On Fri, Nov 8, 2024 at 12:34 AM Li, Pan2 <[email protected]> wrote:
>
> Thanks Tamar and Jeff for comments.
>
> > I'm not sure it's that simple. It'll depend on the micro-architecture.
> > So things like strength of the branch predictors, how fetch blocks are
> > handled (can you have embedded not-taken branches, short-forward-branch
> > optimizations, etc).
>
> > After:
> >
> > .L.sat_add_u_1(unsigned int, unsigned int):
> > add 4,3,4
> > rldicl 9,4,0,32
> > subf 3,3,9
> > sradi 3,3,63
> > or 3,3,4
> > rldicl 3,3,0,32
> > blr
> >
> > and before
> >
> > .L.sat_add_u_1(unsigned int, unsigned int):
> > add 4,3,4
> > cmplw 0,4,3
> > bge 0,.L2
> > li 4,-1
> > .L2:
> > rldicl 3,4,0,32
> > blr
>
> I am not familiar with branch prediction, but the branch should be 50% token
> and 50% not-token
> according to the range of sat add input. It is the worst case for branch
> prediction? I mean if we call
> 100 times with token, not-token, token, not-token... sequence, the branch
> version will be still faster?
> Feel free to correct me if I'm wrong.
>
> Back to these 16 forms of sat add as below, is there any suggestion which one
> or two form(s) may be
> cheaper than others from the perspective of gimple IR? Independent with the
> backend implemented SAT_ADD or not.
>
> #define DEF_SAT_U_ADD_1(T) \
> T sat_u_add_##T##_1 (T x, T y) \
> { \
> return (T)(x + y) >= x ? (x + y) : -1; \
> }
>
> #define DEF_SAT_U_ADD_2(T) \
> T sat_u_add_##T##_2 (T x, T y) \
> { \
> return (T)(x + y) < x ? -1 : (x + y); \
> }
>
> #define DEF_SAT_U_ADD_3(T) \
> T sat_u_add_##T##_3 (T x, T y) \
> { \
> return x <= (T)(x + y) ? (x + y) : -1; \
> }
>
> #define DEF_SAT_U_ADD_4(T) \
> T sat_u_add_##T##_4 (T x, T y) \
> { \
> return x > (T)(x + y) ? -1 : (x + y); \
> }
>
> #define DEF_SAT_U_ADD_5(T) \
> T sat_u_add_##T##_1 (T x, T y) \
> { \
> if ((T)(x + y) >= x) \
> return x + y; \
> else \
> return -1; \
> }
>
> #define DEF_SAT_U_ADD_6(T) \
> T sat_u_add_##T##_6 (T x, T y) \
> { \
> if ((T)(x + y) < x) \
> return -1; \
> else \
> return x + y; \
> }
>
> #define DEF_SAT_U_ADD_7(T) \
> T sat_u_add_##T##_7 (T x, T y) \
> { \
> if (x <= (T)(x + y)) \
> return x + y; \
> else \
> return -1; \
> }
>
> #define DEF_SAT_U_ADD_8(T) \
> T sat_u_add_##T##_8 (T x, T y) \
> { \
> if (x > (T)(x + y)) \
> return -1; \
> else \
> return x + y; \
> }
>
> #define DEF_SAT_U_ADD_9(T) \
> T sat_u_add_##T##_9 (T x, T y) \
> { \
> T ret; \
> return __builtin_add_overflow (x, y, &ret) == 0 ? ret : - 1; \
> }
>
> #define DEF_SAT_U_ADD_10(T) \
> T sat_u_add_##T##_10 (T x, T y) \
> { \
> T ret; \
> return !__builtin_add_overflow (x, y, &ret) ? ret : - 1; \
> }
>
> #define DEF_SAT_U_ADD_11(T) \
> T sat_u_add_##T##_11 (T x, T y) \
> { \
> T ret; \
> if (__builtin_add_overflow (x, y, &ret) == 0) \
> return ret; \
> else \
> return -1; \
> }
>
> #define DEF_SAT_U_ADD_12(T) \
> T sat_u_add_##T##_12 (T x, T y) \
> { \
> T ret; \
> if (!__builtin_add_overflow (x, y, &ret)) \
> return ret; \
> else \
> return -1; \
> }
>
> #define DEF_SAT_U_ADD_13(T) \
> T sat_u_add_##T##_13 (T x, T y) \
> { \
> T ret; \
> return __builtin_add_overflow (x, y, &ret) != 0 ? -1 : ret; \
> }
>
> #define DEF_SAT_U_ADD_14(T) \
> T sat_u_add_##T##_14 (T x, T y) \
> { \
> T ret; \
> return __builtin_add_overflow (x, y, &ret) ? -1 : ret; \
> }
>
> #define DEF_SAT_U_ADD_15(T) \
> T sat_u_add_##T##_15 (T x, T y) \
> { \
> T ret; \
> if (__builtin_add_overflow (x, y, &ret) != 0) \
> return -1; \
> else \
> return ret; \
> }
>
> #define DEF_SAT_U_ADD_16(T) \
> T sat_u_add_##T##_16 (T x, T y) \
> { \
> T ret; \
> if (__builtin_add_overflow (x, y, &ret)) \
> return -1; \
> else \
> return ret; \
> }
So for the late .SAT_* introduction in the widen_mul pass the
__builtin_add_overflow
variants make sense to be used iff the target has optabs for those.
The IL for the
builtin variant should appear cheap as well:
_6 = .ADD_OVERFLOW (x_4(D), y_5(D));
_2 = IMAGPART_EXPR <_6>;
if (_2 != 0)
goto <bb 4>; [21.72%]
else
goto <bb 3>; [78.28%]
<bb 3> [local count: 840525096]:
_1 = REALPART_EXPR <_6>;
<bb 4> [local count: 1073741824]:
# _3 = PHI <65535(2), _1(3)>
it's possibly also the best canonicalization for the overflow check
though I fear code
generation for targets that do not support __builtin_*_overflow might
end up with
two branches.
That said - I'd avoid canonicalizing this via match.pd given that
inevitably will
if-convert. Instead I'd see it as a way to provide a generic .SAT_* expansion
though one could say we should then simply implement fallback expansion
for the internal function. It's also not necessarily your
responsibility to implement
this since risc-v does have .SAT_* expanders, so does x86.
Richard.
> Pan
>
> -----Original Message-----
> From: Jeff Law <[email protected]>
> Sent: Friday, November 8, 2024 4:08 AM
> To: Tamar Christina <[email protected]>; Li, Pan2 <[email protected]>;
> Richard Biener <[email protected]>
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]
> Subject: Re: [PATCH v2 01/10] Match: Simplify branch form 4 of unsigned
> SAT_ADD into branchless
>
>
>
> On 11/7/24 8:07 AM, Tamar Christina wrote:
> >
> >> -----Original Message-----
> >> From: Li, Pan2 <[email protected]>
> >> Sent: Thursday, November 7, 2024 12:57 PM
> >> To: Tamar Christina <[email protected]>; Richard Biener
> >> <[email protected]>
> >> Cc: [email protected]; [email protected]; [email protected];
> >> [email protected]; [email protected]
> >> Subject: RE: [PATCH v2 01/10] Match: Simplify branch form 4 of unsigned
> >> SAT_ADD into branchless
> >>
> >> I see your point that the backend can leverage condition move to emit the
> >> branch
> >> code.
> >>
> >>> For instance see https://godbolt.org/z/fvrq3aq6K
> >>> On ISAs with conditional operations the branch version gets ifconverted.
> >>> On AArch64 we get:
> >>> sat_add_u_1(unsigned int, unsigned int):
> >>> adds w0, w0, w1
> >>> csinv w0, w0, wzr, cc
> >>> ret
> >>> so just 2 instructions, and also branchless. On x86_64 we get:
> >>> sat_add_u_1(unsigned int, unsigned int):
> >>> add edi, esi
> >>> mov eax, -1
> >>> cmovnc eax, edi
> >>> ret
> >>> so 3 instructions but a dependency chain of 2.
> >>> also branchless. This patch would regress both of these.
> >>
> >> But the above Godbolt may not be a good example for evidence, because both
> >> x86_64 and aarch64 implemented usadd
> >> already.
> >> Thus, they all go to usadd<QImode>. For example as below, the sat_add_u_1
> >> and
> >> sat_add_u_2 are almost the
> >> same when the backend implemented usadd.
> >>
> >> #include <stdint.h>
> >>
> >> #define T uint32_t
> >>
> >> T sat_add_u_1 (T x, T y)
> >> {
> >> return (T)(x + y) < x ? -1 : (x + y);
> >> }
> >>
> >> T sat_add_u_2 (T x, T y)
> >> {
> >> return (x + y) | -((x + y) < x);
> >> }
> >>
> >> It will become different when take gcc 14.2 (which doesn’t have .SAT_ADD
> >> GIMPLE
> >> IR), the x86_64
> >> will have below asm dump for -O3. Looks like no obvious difference here.
> >>
> >> sat_add_u_1(unsigned int, unsigned int):
> >> add edi, esi
> >> mov eax, -1
> >> cmovnc eax, edi
> >> ret
> >>
> >> sat_add_u_2(unsigned int, unsigned int):
> >> add edi, esi
> >> sbb eax, eax
> >> or eax, edi
> >> ret
> >>
> >
> > Because CE is able to recognize the idiom back into a conditional move.
> > Pick a target that doesn't have conditional instructions, like PowerPC
> > https://godbolt.org/z/4bTv18WMv
> >
> > You'll see that this canonicalization has made codegen worse.
> >
> > After:
> >
> > .L.sat_add_u_1(unsigned int, unsigned int):
> > add 4,3,4
> > rldicl 9,4,0,32
> > subf 3,3,9
> > sradi 3,3,63
> > or 3,3,4
> > rldicl 3,3,0,32
> > blr
> >
> > and before
> >
> > .L.sat_add_u_1(unsigned int, unsigned int):
> > add 4,3,4
> > cmplw 0,4,3
> > bge 0,.L2
> > li 4,-1
> > .L2:
> > rldicl 3,4,0,32
> > blr
> >
> > It means now it always has to execute 6 instructions, whereas before it was
> > 4 or 5 depending
> > on the order of the branch. So for those architectures, it's always slower.
> I'm not sure it's that simple. It'll depend on the micro-architecture.
> So things like strength of the branch predictors, how fetch blocks are
> handled (can you have embedded not-taken branches, short-forward-branch
> optimizations, etc).
>
> Jeff