On Fri, Jul 24, 2015 at 11:55:33AM +0100, Kyrill Tkachov wrote: > Hi all, > > This patch implements an aarch64-specific expansion of the signed modulo by a > power of 2. > The proposed sequence makes use of the conditional negate instruction CSNEG. > For a power of N, x % N can be calculated with: > negs x1, x0 > and x0, x0, #(N - 1) > and x1, x1, #(N - 1) > csneg x0, x0, x1, mi > > So, for N == 256 this would be: > negs x1, x0 > and x0, x0, #255 > and x1, x1, #255 > csneg x0, x0, x1, mi > > For comparison, the existing sequence emitted by expand_smod_pow2 in expmed.c > is: > asr x1, x0, 63 > lsr x1, x1, 56 > add x0, x0, x1 > and x0, x0, 255 > sub x0, x0, x1 > > Note that the CSNEG sequence is one instruction shorter and that the two and > operations > are independent, compared to the existing sequence where all instructions are > dependent > on the preceeding instructions. > > For the special case of N == 2 we can do even better: > cmp x0, xzr > and x0, x0, 1 > csneg x0, x0, x0, ge > > I first tried implementing this in the generic code in expmed.c but that > didn't work > out for a few reasons: > > * This relies on having a conditional-negate instruction. We could gate it on > HAVE_conditional_move and the combiner is capable of merging the final negate > into > the conditional move if a conditional negate is available (like on aarch64) > but on > targets without a conditional negate this would end up emitting a separate > negate. > > * The first negs has to be a negs for the sequence to be a win i.e. having a > separate > negate and compare makes the sequence slower than the existing one (at least > in my > microbenchmarking) and I couldn't get subsequent passes to combine the negate > and combine > into the negs (presumably due to the use of the negated result in one of the > ands). > Doing it in the aarch64 backend where I could just call the exact gen_* > functions that > I need worked much more cleanly. > > The costing logic is updated to reflect this sequence during the > intialisation of > expmed.c where it calculates the smod_pow2_cheap metric. > > The tests will come in patch 3 of the series which are partly shared with the > equivalent > arm implementation. > > Bootstrapped and tested on aarch64. > Ok for trunk? > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > index 9d88a60..7bb4a55 100644 > --- a/gcc/config/aarch64/aarch64.c > +++ b/gcc/config/aarch64/aarch64.c > @@ -6639,8 +6639,26 @@ cost_plus: > if (VECTOR_MODE_P (mode)) > *cost += extra_cost->vect.alu; > else if (GET_MODE_CLASS (mode) == MODE_INT) > - *cost += (extra_cost->mult[mode == DImode].add > - + extra_cost->mult[mode == DImode].idiv); > + { > + /* We can expand signed mod by power of 2 using a > + NEGS, two parallel ANDs and a CSNEG. Assume here > + that CSNEG is COSTS_N_INSNS (1). This case should > + only ever be reached through the set_smod_pow2_cheap check > + in expmed.c. */ > + if (code == MOD > + && CONST_INT_P (XEXP (x, 1)) > + && exact_log2 (INTVAL (XEXP (x, 1))) > 0 > + && (mode == SImode || mode == DImode)) > + { > + *cost += COSTS_N_INSNS (3) > + + 2 * extra_cost->alu.logical > + + extra_cost->alu.arith; > + return true; > + } > + > + *cost += (extra_cost->mult[mode == DImode].add > + + extra_cost->mult[mode == DImode].idiv); > + } > else if (mode == DFmode) > *cost += (extra_cost->fp[1].mult > + extra_cost->fp[1].div);
This looks like it calculates the wrong cost for !speed. I think we will still expand through mod<mode>3 when compiling for size, so we probably still want to cost the multiple instructions. Have I misunderstood? Thanks, James