On 24/07/15 10:00, Richard Biener wrote:
On Fri, 24 Jul 2015, Kyrill Tkachov wrote:
On 24/07/15 09:23, Jakub Jelinek wrote:
On Fri, Jul 24, 2015 at 09:09:39AM +0100, Kyrill Tkachov wrote:
This transformation folds (X % C) == N into
X & ((1 << (size - 1)) | (C - 1))) == N
for constants C and N where N is positive and C is a power of 2.
The idea is similar to the existing X % C -> X & (C - 1) transformation
for unsigned values but this time when we have the comparison we use the
C - 1 mask (all 1s for power of 2 N) orred with the sign bit.
At runtime, if X is positive then X & ((1 << (size - 1)) | (C - 1)))
calculates X % C, which is compared against the positive N.
If X is negative then X & ((1 << (size - 1)) | (C - 1))) doesn't calculate
X % C but since we also catch the set sign bit, it will never compare
equal
to N (which is positive), so we still get the right comparison result.
I don't have much experience with writing match.pd patterns, so I
appreciate
any feedback on how to write this properly.
Bootstrapped and tested on arm, aarch64, x86_64.
I think this is another case that, if at all, should be done during or right
before RTL expansion and should test rtx costs.
Hmm, where would that be?
expmed.c has a lot of code to expand div or mod by a power of 2.
In what form would a compare with a mod by power of 2 arrive to
the expansion phase?
It arrives as SSA_NAME == N and you can use get_gimple_for_ssa_name
or get_def_for_expr to get at the defining stmt if that is possible
(it's still unexpanded and thus TERed) and expand a different
expression.
Thanks, so it's where we expand compares... (what's TERed?)
But why can't simplify-rtx via combine handle this - it should have
access to target costs.
That would require for the target to expand to an SMOD rtx
which, if the target has no direct instruction for would be somewhat
awkward.
Thanks,
Kyrill
Because, ((1 << (size - 1)) | (C - 1))) constant might be very expensive,
while C cheap, and % might not be that expensive compared to & to offset
that.
E.g. on x86_64, for 32-bit and smaller X the constant is cheap as any other
(well, if we don't take instruction size into account), but 64-bit constant
is at least 3 times more expensive (movabsq is needed with its latency).
In the x86_64 case supposedly the divmod is still more expensive, but there
are many other targets. On sparc64 for 64-bit constants, you might need
many instructions to create the constants, etc.
Ok, I am not familiar with sparc64. The constant is just a 1
in the sign bit orred with a continuous string of ones.
That's usually cheap on aarch64 but may not be so on other targets.
On GIMPLE we might still want to canonicalize to one form. I'd
canonicalize to the form with "smaller" constants if the number
of operations is the same.
Richard.
Thanks,
Kyrill
Jakub