> Segher Boessenkool wrote:
> On Thu, Sep 03, 2015 at 12:43:34PM +0100, Wilco Dijkstra wrote:
> > > > Combine canonicalizes certain AND masks in a comparison with zero into
> > > > extracts of the
> > > widest
> > > > register type. During matching these are expanded into a very
> > > > inefficient sequence that
> > > fails to
> > > > match. For example (x & 2) == 0 is matched in combine like this:
> > > >
> > > > Failed to match this instruction:
> > > > (set (reg:CC 66 cc)
> > > > (compare:CC (zero_extract:DI (subreg:DI (reg/v:SI 76 [ xD.2641 ]) 0)
> > > > (const_int 1 [0x1])
> > > > (const_int 1 [0x1]))
> > > > (const_int 0 [0])))
> > >
> > > Yes. Some processors even have specific instructions to do this.
> >
> > However there are 2 issues with this, one is the spurious subreg,
>
> Combine didn't make that up out of thin air; something already used
> DImode here. It could simplify it to SImode in this case, that is
> true, don't know why it doesn't; it isn't necessarily faster code to
> do so, it can be slower, it might not match, etc.
The relevant RTL instructions on AArch64 are:
(insn 8 3 25 2 (set (reg:SI 77 [ D.2705 ])
(and:SI (reg/v:SI 76 [ xD.2641 ])
(const_int 2 [0x2]))) tmp5.c:122 452 {andsi3}
(nil))
(insn 26 25 27 2 (set (reg:CC 66 cc)
(compare:CC (reg:SI 77 [ D.2705 ])
(const_int 0 [0]))) tmp5.c:122 377 {*cmpsi}
(expr_list:REG_DEAD (reg:SI 77 [ D.2705 ])
(nil)))
I don't see anything using DI...
> > the other is
> > that only a subset of legal zero_extracts are tried (only single bit and the
> > degenerate case of zero_extract with shift of 0 - which I think should not
> > be a
> > zero_extract). All other AND immediate remain as AND.
>
> Yes. I'm happy to see this weird special case "optimisation",
> "canocalisation" gone.
>
> > So to emit an AND on targets without such specific instructions, or where
> > such
> > instructions are more expensive than a simple AND (*), you need now at
> > least 3 different
> > backend patterns for any instruction that can emit AND immediate...
>
> It's only a problem for AND-and-compare, no?
Yes, so it looks like some other backends match the odd pattern and then have
another
pattern change it back into the canonical AND/TST form during the split phase
(maybe
the subreg confuses register allocation or block other optimizations). This all
seems
a lot of unnecessary complexity for a few special immediates when there is a
much
simpler solution...
> > (*) I think that is another issue in combine - when both alternatives match
> > you
> > want to select the lowest cost one, not the first one that matches.
>
> That's recog, not combine. And quite a few backends rely on "first match
> wins", because it always has been that way. It also is very easy to write
> such patterns accidentally (sometimes even the exact same one twice in the
> same machine description, etc.)
>
> > > > Failed to match this instruction:
> > > > (set (reg:CC 66 cc)
> > > > (compare:CC (and:DI (lshiftrt:DI (subreg:DI (reg/v:SI 76 [ xD.2641
> > > > ]) 0)
> > > > (const_int 1 [0x1]))
> > > > (const_int 1 [0x1]))
> > > > (const_int 0 [0])))
> > >
> > > This is after r223067. Combine tests only one "final" instruction; that
> > > revision rewrites zero_ext* if it doesn't match and tries again. This
> > > helps for processors that can do more generic masks (like rs6000, and I
> > > believe also aarch64?): without it, you need many more patterns to match
> > > all the zero_ext{ract,end} cases.
> >
> > But there are more efficient ways to emit single bit and masks tests that
> > apply
> > to most CPUs rather than doing something specific that works for just one
> > target
> > only. For example single bit test is a simple shift into carry flag or into
> > the
> > sign bit, and for mask tests, you shift out all the non-mask bits.
>
> Most of those are quite target-specific. Some others are already done,
> and/or done by other passes.
But what combine does here is even more target-specific. Shifts and AND setting
flags
are universally supported on targets with condition code register.
Bitfield test/extract instructions are more rare, and when they are supported,
they
may well be more expensive.
> > So my question is, is it combine's job to try all possible permutations that
> > constitute a bit or mask test?
>
> Combine converts the merged instructions to what it thinks is the
> canonical or cheapest form, and uses that. It does not try multiple
> options (the zero_ext* -> and+shift rewriting is not changing the
> semantics of the pattern at all).
But the change from AND to zero_extract is already changing semantics...
> > Or would it be better to let each target decide
> > on how to canonicalize bit tests and only try that alternative?
>
> The question is how to write the pattern to be most convenient for all
> targets.
The obvious choice is to try the 2 original instructions merged.
> > > > Neither matches the AArch64 patterns for ANDS/TST (which is just
> > > > compare and AND). If
> the
> > > immediate
> > > > is not a power of 2 or a power of 2 -1 then it matches correctly as
> > > > expected.
> > > >
> > > > I don't understand how ((x >> 1) & 1) != 0 could be a useful expansion
> > >
> > > It is zero_extract(x,1,1) really. This is convenient for (old and
> > > embedded)
> > > processors that have special bit-test instructions. If we now want
> > > combine
> > > to not do this, we'll have to update all backends that rely on it.
> >
> > Would any backend actually rely on this given it only does some specific
> > masks,
> > has a redundant shift with 0 for the mask case and the odd subreg as well?
>
> Such backends match the zero_extract patterns, of course. Random example:
> the h8300 patterns for the "btst" instruction.
>
> > > They are common, and many processors had instructions for them, which is
> > > (supposedly) why combine special-cased them.
> >
> > Yes, but that doesn't mean (x & C) != 0 shouldn't be tried as well...
>
> Combine does not try multiple options.
I'm not following - combine tries zero_extract and shift+AND - that's 2 options.
If that is feasible then adding a 3rd option should be possible.
> > > > It's trivial to change change_zero_ext to expand extracts always into
> > > > AND and remove the
> > > redundant
> > > > subreg.
> > >
> > > Not really trivial... Think about comparisons...
> > >
> > > int32_t x;
> > > ((x >> 31) & 1) > 0;
> > > // is not the same as
> > > (x & 0x80000000) > 0; // signed comparison
> > >
> > > (You do not easily know what the COMPARE is used for).
> >
> > Indeed if you had a zero_extract that included the sign-bit then you may
> > have to adjust
> > the compare condition. However it looks like that case can't happen - (x &
> > 0x80000000)
> > comparisons with have the AND optimized away much earlier.
>
> A long time before combine, yes. But it is *possible* to give such insns
> to combine, and it should not do the wrong thing with them.
Yes. So that suggests we'd need to block the canonicalization rather than undo
it.
> > > > However wouldn't it make more sense to never special case certain AND
> > > > immediate in the
> first
> > > > place?
> > >
> > > Yes, but we need to make sure no targets regress (i.e. by rewriting
> > > patterns
> > > for those that do). We need to know the impact of such a change, at the
> > > least.
> >
> > The alternative would be let the target decide how to canonicalize bit
> > tests.
> > That seems like a better solution than trying multiple possibilities which
> > can never
> > match on most targets.
>
> You will end up with a *lot* of target hooks like this. It will also
> make testing harder (less coverage). I am not sure that is a good idea.
We certainly need a lot more target hooks in general so GCC can do the right
thing
(rather than using costs inconsistently all over the place). But that's a
different
discussion...
Wilco