> -----Original Message----- > From: Richard Biener <richard.guent...@gmail.com> > Sent: Monday, October 9, 2023 10:45 AM > To: Tamar Christina <tamar.christ...@arm.com> > Cc: Richard Sandiford <richard.sandif...@arm.com>; gcc- > patc...@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw > <richard.earns...@arm.com>; Marcus Shawcroft > <marcus.shawcr...@arm.com>; Kyrylo Tkachov <kyrylo.tkac...@arm.com> > Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign. > > On Mon, Oct 9, 2023 at 11:39 AM Tamar Christina > <tamar.christ...@arm.com> wrote: > > > > > -----Original Message----- > > > From: Richard Sandiford <richard.sandif...@arm.com> > > > Sent: Saturday, October 7, 2023 10:58 AM > > > To: Richard Biener <richard.guent...@gmail.com> > > > Cc: Tamar Christina <tamar.christ...@arm.com>; > > > gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw > > > <richard.earns...@arm.com>; Marcus Shawcroft > > > <marcus.shawcr...@arm.com>; Kyrylo Tkachov > <kyrylo.tkac...@arm.com> > > > Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign. > > > > > > Richard Biener <richard.guent...@gmail.com> writes: > > > > On Thu, Oct 5, 2023 at 10:46 PM Tamar Christina > > > <tamar.christ...@arm.com> wrote: > > > >> > > > >> > -----Original Message----- > > > >> > From: Richard Sandiford <richard.sandif...@arm.com> > > > >> > Sent: Thursday, October 5, 2023 9:26 PM > > > >> > To: Tamar Christina <tamar.christ...@arm.com> > > > >> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw > > > >> > <richard.earns...@arm.com>; Marcus Shawcroft > > > >> > <marcus.shawcr...@arm.com>; Kyrylo Tkachov > > > <kyrylo.tkac...@arm.com> > > > >> > Subject: Re: [PATCH]AArch64 Add SVE implementation for > > > cond_copysign. > > > >> > > > > >> > Tamar Christina <tamar.christ...@arm.com> writes: > > > >> > >> -----Original Message----- > > > >> > >> From: Richard Sandiford <richard.sandif...@arm.com> > > > >> > >> Sent: Thursday, October 5, 2023 8:29 PM > > > >> > >> To: Tamar Christina <tamar.christ...@arm.com> > > > >> > >> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard > > > >> > >> Earnshaw <richard.earns...@arm.com>; Marcus Shawcroft > > > >> > >> <marcus.shawcr...@arm.com>; Kyrylo Tkachov > > > >> > <kyrylo.tkac...@arm.com> > > > >> > >> Subject: Re: [PATCH]AArch64 Add SVE implementation for > > > cond_copysign. > > > >> > >> > > > >> > >> Tamar Christina <tamar.christ...@arm.com> writes: > > > >> > >> > Hi All, > > > >> > >> > > > > >> > >> > This adds an implementation for masked copysign along with > > > >> > >> > an optimized pattern for masked copysign (x, -1). > > > >> > >> > > > >> > >> It feels like we're ending up with a lot of AArch64-specific > > > >> > >> code that just hard- codes the observation that changing the > > > >> > >> sign is equivalent to changing the top bit. We then need to > > > >> > >> make sure that we choose the best way of changing the top > > > >> > >> bit for any > > > given situation. > > > >> > >> > > > >> > >> Hard-coding the -1/negative case is one instance of that. > > > >> > >> But it looks like we also fail to use the best sequence for SVE2. > > > >> > >> E.g. > > > >> > >> [https://godbolt.org/z/ajh3MM5jv]: > > > >> > >> > > > >> > >> #include <stdint.h> > > > >> > >> > > > >> > >> void f(double *restrict a, double *restrict b) { > > > >> > >> for (int i = 0; i < 100; ++i) > > > >> > >> a[i] = __builtin_copysign(a[i], b[i]); } > > > >> > >> > > > >> > >> void g(uint64_t *restrict a, uint64_t *restrict b, uint64_t c) { > > > >> > >> for (int i = 0; i < 100; ++i) > > > >> > >> a[i] = (a[i] & ~c) | (b[i] & c); } > > > >> > >> > > > >> > >> gives: > > > >> > >> > > > >> > >> f: > > > >> > >> mov x2, 0 > > > >> > >> mov w3, 100 > > > >> > >> whilelo p7.d, wzr, w3 > > > >> > >> .L2: > > > >> > >> ld1d z30.d, p7/z, [x0, x2, lsl 3] > > > >> > >> ld1d z31.d, p7/z, [x1, x2, lsl 3] > > > >> > >> and z30.d, z30.d, #0x7fffffffffffffff > > > >> > >> and z31.d, z31.d, #0x8000000000000000 > > > >> > >> orr z31.d, z31.d, z30.d > > > >> > >> st1d z31.d, p7, [x0, x2, lsl 3] > > > >> > >> incd x2 > > > >> > >> whilelo p7.d, w2, w3 > > > >> > >> b.any .L2 > > > >> > >> ret > > > >> > >> g: > > > >> > >> mov x3, 0 > > > >> > >> mov w4, 100 > > > >> > >> mov z29.d, x2 > > > >> > >> whilelo p7.d, wzr, w4 > > > >> > >> .L6: > > > >> > >> ld1d z30.d, p7/z, [x0, x3, lsl 3] > > > >> > >> ld1d z31.d, p7/z, [x1, x3, lsl 3] > > > >> > >> bsl z31.d, z31.d, z30.d, z29.d > > > >> > >> st1d z31.d, p7, [x0, x3, lsl 3] > > > >> > >> incd x3 > > > >> > >> whilelo p7.d, w3, w4 > > > >> > >> b.any .L6 > > > >> > >> ret > > > >> > >> > > > >> > >> I saw that you originally tried to do this in match.pd and > > > >> > >> that the decision was to fold to copysign instead. But > > > >> > >> perhaps there's a compromise where isel does something with > > > >> > >> the (new) copysign canonical > > > >> > form? > > > >> > >> I.e. could we go with your new version of the match.pd > > > >> > >> patch, and add some isel stuff as a follow-on? > > > >> > >> > > > >> > > > > > >> > > Sure if that's what's desired.... But.. > > > >> > > > > > >> > > The example you posted above is for instance worse for x86 > > > >> > > https://godbolt.org/z/x9ccqxW6T where the first operation has > > > >> > > a dependency chain of 2 and the latter of 3. It's likely any > > > >> > > open coding of this > > > >> > operation is going to hurt a target. > > > >> > > > > > >> > > So I'm unsure what isel transform this into... > > > >> > > > > >> > I didn't mean that we should go straight to using isel for the > > > >> > general case, just for the new case. The example above was > > > >> > instead trying to show the general point that hiding the logic > > > >> > ops in target code is > > > a double-edged sword. > > > >> > > > >> I see.. but the problem here is that transforming copysign (x, > > > >> -1) into (x | 0x8000000) would require an integer operation on an > > > >> FP value. I'm happy to do it but it seems like it'll be an > > > >> AArch64 only thing > > > anyway. > > > >> > > > >> If we want to do this we need to check can_change_mode_class or a > hook. > > > >> Most targets including x86 reject the conversion. So it'll just > > > >> be effectively an AArch64 thing. > > > >> > > > >> You're right that the actual equivalent transformation is this > > > >> https://godbolt.org/z/KesfrMv5z But the target won't allow it. > > > >> > > > >> > > > > >> > The x86_64 example for the -1 case would be > > > >> > https://godbolt.org/z/b9s6MaKs8 where the isel change would be > > > >> > an improvement. Without that, I guess > > > >> > x86_64 will need to have a similar patch to the AArch64 one. > > > >> > > > > >> > > > >> I think that's to be expected. I think it's logical that every > > > >> target just needs to implement their optabs optimally. > > > >> > > > >> > That said, https://godbolt.org/z/e6nqoqbMh suggests that > > > >> > powerpc64 is probably relying on the current copysign -> neg/abs > transform. > > > >> > (Not sure why the second function uses different IVs from the > > > >> > first.) > > > >> > > > > >> > Personally, I wouldn't be against a target hook that indicated > > > >> > whether float bit manipulation is "free" for a given mode, if > > > >> > it comes to > > > that. > > > >> > > > >> I'm really sure Richi would agree there. Generally speaking I > > > >> don't think people see doing FP operations on Int as beneficial. > > > >> But it's always > > > the case on AArch64. > > > > > > > > IIRC we're doing fpclassify "expansion" early for example. > > > > > > > > Note the issue I see is that the middle-end shouldn't get in the > > > > way of targets that have a copysign optab. In case it's > > > > worthwhile to special-case a "setsign" thing then maybe provide an > > > > optab for that as well. Now, that doesn't help if we want setsign > > > > to be expanded from the middle-end but still wan the copysign > > > > optab (and not require targets to implement both if they want to > > > > escape middle-end generic > > > expansion of setsign). > > > > > > > > But yes, I also thought the , 1 and , -1 cases could be special > > > > cased by RTL expansion (or ISEL). But it would need to invoke > > > > costing which likely means target specific changes anyway... :/ > > > > > > FWIW, if we had the target hook I suggested, I don't think AArch64 > > > would need to implement copysign or xorsign optabs. That's probably > > > true of > > > AArch32 too (at least for all combinations that are likely to matter these > days). > > > > > > I'd go one step further and say that, if a target wants to do its > > > own thing for copysign and xorsign, it clearly doesn't meet the > > > requirement that bit manipulation of floats is "free" for that mode. > > > > > > > So I'm aware I have no say here, but I'll make one last effort. > > > > The patch isn't just implementing the fneg (fabs ()) optimization just > because. > > The location where it's implemented makes a big difference. > > > > If this optimization is done late, it doesn't fix the regression > > fully, because doing It after all optimization passes have run means it > > can't > properly be optimized. > > > > The point of doing the rewrite early to ORR accomplished two things: > > > > 1. It makes PRE realize that the block it's splitting would only have 1 > instruction in it > > and that such a split is not beneficial. This is why I'm against doing > > such > optimizations > > so later. Optimizations don’t' happen in isolation, and when they make > sense should > > happen early. Transforming fneg (fabs (..)) in isel not only feels > > wrong but > results in a 4% > > performance loss. > > > > 2. Doing it early also gets the ORRs to be reassociated changing where the > loop dependent > > variable lands. Early makes it land in the merging MOVPRFX rather than > requiring a SEL > > at the end of the iteration. > > > > Replacing the fneg (fabs (..)) with copysign accomplishes 1 but no 2. > > This results in a 2-3% loss, but I can live with that given doing 1 gets us > > back > to GCC 12 levels. > > > > Doing fneg (fabs (..)) in isel would have no meaning for me and not > > fix the regression. I won't be looking to do that in that case. > > > > If it's acceptable I can transform COPYSIGN (X, -1) in gimple-isel. > > So before I start on this, Would this be acceptable for you? > > Since copysign (x, -1) is a single statement you can just massage the internal > function expander? > Or the generic expand_copysign which already has a bit operation fallback. > The question is what 'copysign' to use during folding of fneg (fabs (...)) > when > you remove the backend expander (because then IFN_COPYSIGN isn't directly > expandable ...) >
Indeed, I forgot to say so in my previous email. I don't think removing the pattern makes sense. It doesn't work for IFN_COPYSIGN and as such wouldn't work for vectors. So intrintrics wouldn't work. I guess we could somehow make IFN_COPYSIGN synthetic like COPYSIGN but why.. there's apparently even an RTL code for copysign that POWER has just implemented. Regards, Tamar > Richard. > > > > > Thanks, > > Tamar > > > > > > So I have no good advice here but I hoped that even the generic > > > > target specific copysign implementation with and & xor would > > > > eventually be optimized later on RTL for constant second arg? > > > > > > Yeah. It looks like the required logic is there for scalars, it > > > just needs extending to vectors. > > > > > > The patch below (untested beyond simple cases) seems to be enough to > > > fix it, but using the simplify routines even for CONST_INTs might be > controversial. > > > > > > Thanks, > > > Richard > > > > > > > > > diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc index > > > bd9443dbcc2..5a9b1745673 100644 > > > --- a/gcc/simplify-rtx.cc > > > +++ b/gcc/simplify-rtx.cc > > > @@ -3409,20 +3409,20 @@ > > > simplify_context::simplify_binary_operation_1 > > > (rtx_code code, > > > > > > /* Canonicalize (X & C1) | C2. */ > > > if (GET_CODE (op0) == AND > > > - && CONST_INT_P (trueop1) > > > - && CONST_INT_P (XEXP (op0, 1))) > > > + && CONSTANT_P (trueop1) > > > + && CONSTANT_P (XEXP (op0, 1))) > > > { > > > - HOST_WIDE_INT mask = GET_MODE_MASK (mode); > > > - HOST_WIDE_INT c1 = INTVAL (XEXP (op0, 1)); > > > - HOST_WIDE_INT c2 = INTVAL (trueop1); > > > + rtx c1 = XEXP (op0, 1); > > > + rtx c2 = trueop1; > > > > > > /* If (C1&C2) == C1, then (X&C1)|C2 becomes C2. */ > > > - if ((c1 & c2) == c1 > > > + if (rtx_equal_p (simplify_binary_operation (AND, mode, c1, > > > + c2), c1) > > > && !side_effects_p (XEXP (op0, 0))) > > > return trueop1; > > > > > > /* If (C1|C2) == ~0 then (X&C1)|C2 becomes X|C2. */ > > > - if (((c1|c2) & mask) == mask) > > > + if (rtx_equal_p (simplify_binary_operation (IOR, mode, c1, c2), > > > + CONSTM1_RTX (mode))) > > > return simplify_gen_binary (IOR, mode, XEXP (op0, 0), op1); > > > } > > > > > > @@ -3732,7 +3732,7 @@ simplify_context::simplify_binary_operation_1 > > > (rtx_code code, > > > machines, and also has shorter instruction path length. */ > > > if (GET_CODE (op0) == AND > > > && GET_CODE (XEXP (op0, 0)) == XOR > > > - && CONST_INT_P (XEXP (op0, 1)) > > > + && CONSTANT_P (XEXP (op0, 1)) > > > && rtx_equal_p (XEXP (XEXP (op0, 0), 0), trueop1)) > > > { > > > rtx a = trueop1; > > > @@ -3746,7 +3746,7 @@ simplify_context::simplify_binary_operation_1 > > > (rtx_code code, > > > /* Similarly, (xor (and (xor A B) C) B) as (ior (and A C) (and B > > > ~C)) */ > > > else if (GET_CODE (op0) == AND > > > && GET_CODE (XEXP (op0, 0)) == XOR > > > - && CONST_INT_P (XEXP (op0, 1)) > > > + && CONSTANT_P (XEXP (op0, 1)) > > > && rtx_equal_p (XEXP (XEXP (op0, 0), 1), trueop1)) > > > { > > > rtx a = XEXP (XEXP (op0, 0), 0); > > > -- > > > 2.25.1 > > > > >