Sorry (again) for the disruption. Hopefully, Jakub agrees, but I’d like to think that this
isn’t a problem with my recent patch, but a slightly fragile test case. PR target/100056 concerned an issue where the compiler would generate three instructions for a given idiom when this could optimally be done in two. And this was resolved by Jakub’s backend splitters. With my recent change, the code generated for (i<<11)|i is now the same as has always been generated for (i<<11)+i and 2049*i, which is still be done optimally in two instructions. Alas, the trouble is that one of those instructions is a [us]bfiz. Both i+(i<<11) and i|(i<<11) compute exactly the same result, for unsigned char i, so the question is whether one implementation is superior to the other, in which case GCC should generate that form for both expressions. I’m unfamiliar with AArch64, but presumably the code generated by synth_mult for 2049*i is optimal if the backend’s RTX_COST is well parameterized. [LLVM also generates a two instruction sequence for 2049*I on ARM64]. I believe a solution is to tweak the test case by changing the dg-final condition to: scan-assembler-not \\t[us]bfiz\\tw0, w0, 11 which failed prior to Jakub’s fix to PR100056, and passes since then, but this too is a little fragile; ideally we’d count the number of instructions. Does this sound like a reasonable fix/workaround? Or is the issue that [us]bfiz is slow, and that synth_mult (and the addition form) should also avoid generating this insn? Thanks in advance, Roger -- From: Christophe Lyon <christophe.lyon....@gmail.com> Sent: 05 August 2021 14:53 To: Roger Sayle <ro...@nextmovesoftware.com> Cc: GCC Patches <gcc-patches@gcc.gnu.org>; Marc Glisse <marc.gli...@inria.fr> Subject: Re: [PATCH take 2] Fold (X<<C1)^(X<<C2) to a multiplication when possible. On Wed, Jul 28, 2021 at 2:45 PM Roger Sayle <ro...@nextmovesoftware.com <mailto:ro...@nextmovesoftware.com> > wrote: Hi Marc, Thanks for the feedback. After some quality time in gdb, I now appreciate that match.pd behaves (subtly) differently between generic and gimple, and the trees actually being passed to tree_nonzero_bits were not quite what I had expected. Sorry for my confusion, the revised patch below is now much shorter (and my follow-up patch that was originally to tree_nonzero_bits looks like it now needs to be to get_nonzero_bits!). This revised patch has been retested on 864_64-pc-linux-gnu with a "make bootstrap" and "make -k check" with no new failures. Ok for mainline? 2021-07-28 Roger Sayle <ro...@nextmovesoftware.com <mailto:ro...@nextmovesoftware.com> > Marc Glisse <marc.gli...@inria.fr <mailto:marc.gli...@inria.fr> > gcc/ChangeLog * match.pd (bit_ior, bit_xor): Canonicalize (X*C1)|(X*C2) and (X*C1)^(X*C2) as X*(C1+C2), and related variants, using tree_nonzero_bits to ensure that operands are bit-wise disjoint. gcc/testsuite/ChangeLog * gcc.dg/fold-ior-4.c: New test. Hi, This patch introduces a regression on aarch64: FAIL: gcc.target/aarch64/pr100056.c scan-assembler-not \\t[us]bfiz\\tw[0-9 <file://t[us]bfiz/tw%5b0-9> ]+, w[0-9]+, 11 Can you check? Thanks, Christophe Roger -- -----Original Message----- From: Marc Glisse <marc.gli...@inria.fr <mailto:marc.gli...@inria.fr> > Sent: 26 July 2021 16:45 To: Roger Sayle <ro...@nextmovesoftware.com <mailto:ro...@nextmovesoftware.com> > Cc: 'GCC Patches' <gcc-patches@gcc.gnu.org <mailto:gcc-patches@gcc.gnu.org> > Subject: Re: [PATCH] Fold (X<<C1)^(X<<C2) to a multiplication when possible. On Mon, 26 Jul 2021, Roger Sayle wrote: > The one aspect that's a little odd is that each transform is paired > with a convert@1 variant, using the efficient match machinery to > expose any zero extension to fold-const.c's tree_nonzero_bits functionality. Copying the first transform for context +(for op (bit_ior bit_xor) + (simplify + (op (mult:s@0 @1 INTEGER_CST@2) + (mult:s@3 @1 INTEGER_CST@4)) + (if (INTEGRAL_TYPE_P (type) && TYPE_OVERFLOW_WRAPS (type) + && (tree_nonzero_bits (@0) & tree_nonzero_bits (@3)) == 0) + (mult @1 + { wide_int_to_tree (type, wi::to_wide (@2) + wi::to_wide (@4)); }))) +(simplify + (op (mult:s@0 (convert@1 @2) INTEGER_CST@3) + (mult:s@4 (convert@1 @2) INTEGER_CST@5)) + (if (INTEGRAL_TYPE_P (type) && TYPE_OVERFLOW_WRAPS (type) + && (tree_nonzero_bits (@0) & tree_nonzero_bits (@4)) == 0) + (mult @1 + { wide_int_to_tree (type, wi::to_wide (@3) + wi::to_wide (@5)); }))) Could you explain how the convert helps exactly? -- Marc Glisse