Sorry (again) for the disruption.  Hopefully, Jakub agrees, but I’d like to 
think that this

isn’t a problem with my recent patch, but a slightly fragile test case.  PR 
target/100056 concerned

an issue where the compiler would generate three instructions for a given idiom 
when

this could optimally be done in two.  And this was resolved by Jakub’s backend 
splitters.

With my recent change, the code generated for (i<<11)|i is now the same as has 
always

been generated for (i<<11)+i and 2049*i, which is still be done optimally in 
two instructions.

Alas, the trouble is that one of those instructions is a [us]bfiz.

 

Both i+(i<<11) and i|(i<<11) compute exactly the same result, for unsigned char 
i, so

the question is whether one implementation is superior to the other, in which 
case 

GCC should generate that form for both expressions.

 

I’m unfamiliar with AArch64, but presumably the code generated by synth_mult 
for 2049*i

is optimal if the backend’s RTX_COST is well parameterized.  [LLVM also 
generates a two

instruction sequence for 2049*I on ARM64].

 

I believe a solution is to tweak the test case by changing the dg-final 
condition to:

scan-assembler-not \\t[us]bfiz\\tw0, w0, 11

which failed prior to Jakub’s fix to PR100056, and passes since then, but this 
too

is a little fragile; ideally we’d count the number of instructions.

 

Does this sound like a reasonable fix/workaround?  Or is the issue that 
[us]bfiz is slow,

and that synth_mult (and the addition form) should also avoid generating this 
insn?

 

Thanks in advance,

Roger

--

 

From: Christophe Lyon <christophe.lyon....@gmail.com> 
Sent: 05 August 2021 14:53
To: Roger Sayle <ro...@nextmovesoftware.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>; Marc Glisse <marc.gli...@inria.fr>
Subject: Re: [PATCH take 2] Fold (X<<C1)^(X<<C2) to a multiplication when 
possible.

 

 

 

On Wed, Jul 28, 2021 at 2:45 PM Roger Sayle <ro...@nextmovesoftware.com 
<mailto:ro...@nextmovesoftware.com> > wrote:


Hi Marc,

Thanks for the feedback.  After some quality time in gdb, I now appreciate
that
match.pd behaves (subtly) differently between generic and gimple, and the
trees actually being passed to tree_nonzero_bits were not quite what I had
expected.  Sorry for my confusion, the revised patch below is now much
shorter
(and my follow-up patch that was originally to tree_nonzero_bits looks like
it
now needs to be to get_nonzero_bits!).

This revised patch has been retested on 864_64-pc-linux-gnu with a
"make bootstrap" and "make -k check" with no new failures.

Ok for mainline?

2021-07-28  Roger Sayle  <ro...@nextmovesoftware.com 
<mailto:ro...@nextmovesoftware.com> >
            Marc Glisse <marc.gli...@inria.fr <mailto:marc.gli...@inria.fr> >

gcc/ChangeLog
        * match.pd (bit_ior, bit_xor): Canonicalize (X*C1)|(X*C2) and
        (X*C1)^(X*C2) as X*(C1+C2), and related variants, using
        tree_nonzero_bits to ensure that operands are bit-wise disjoint.

gcc/testsuite/ChangeLog
        * gcc.dg/fold-ior-4.c: New test.

 

Hi,

 

This patch introduces a regression on aarch64:

FAIL:    gcc.target/aarch64/pr100056.c scan-assembler-not \\t[us]bfiz\\tw[0-9 
<file://t[us]bfiz/tw%5b0-9> ]+, w[0-9]+, 11

 

Can you check?

 

Thanks,

 

Christophe

 

 

Roger
--

-----Original Message-----
From: Marc Glisse <marc.gli...@inria.fr <mailto:marc.gli...@inria.fr> > 
Sent: 26 July 2021 16:45
To: Roger Sayle <ro...@nextmovesoftware.com <mailto:ro...@nextmovesoftware.com> 
>
Cc: 'GCC Patches' <gcc-patches@gcc.gnu.org <mailto:gcc-patches@gcc.gnu.org> >
Subject: Re: [PATCH] Fold (X<<C1)^(X<<C2) to a multiplication when possible.

On Mon, 26 Jul 2021, Roger Sayle wrote:

> The one aspect that's a little odd is that each transform is paired 
> with a convert@1 variant, using the efficient match machinery to 
> expose any zero extension to fold-const.c's tree_nonzero_bits
functionality.

Copying the first transform for context

+(for op (bit_ior bit_xor)
+ (simplify
+  (op (mult:s@0 @1 INTEGER_CST@2)
+      (mult:s@3 @1 INTEGER_CST@4))
+  (if (INTEGRAL_TYPE_P (type) && TYPE_OVERFLOW_WRAPS (type)
+       && (tree_nonzero_bits (@0) & tree_nonzero_bits (@3)) == 0)
+   (mult @1
+        { wide_int_to_tree (type, wi::to_wide (@2) + wi::to_wide (@4));
})))  
+(simplify
+  (op (mult:s@0 (convert@1 @2) INTEGER_CST@3)
+      (mult:s@4 (convert@1 @2) INTEGER_CST@5))
+  (if (INTEGRAL_TYPE_P (type) && TYPE_OVERFLOW_WRAPS (type)
+       && (tree_nonzero_bits (@0) & tree_nonzero_bits (@4)) == 0)
+   (mult @1
+        { wide_int_to_tree (type, wi::to_wide (@3) + wi::to_wide (@5));
})))

Could you explain how the convert helps exactly?

--
Marc Glisse

Reply via email to