"Maciej W. Rozycki" <ma...@codesourcery.com> writes: > On Tue, 18 Sep 2012, Richard Sandiford wrote: > >> > Have you had time to think about this some more? I am not sure I can >> > guess how you'd like me to fix this patch now without some more specific >> > review and/or suggestions about where the optimization should happen and >> > what cases it should be extended to detect in addition to the dsp >> > accumulator multiplies. >> >> The patch below is the one I've been testing. But I got sidetracked >> by looking into the possibility of removing the MD0_REG and MD1_REG >> classes, in order to get more sensible costs. I think that was needed >> for the madd-9.c test to pass. > > Sorry to come up with this so late -- I have only now noticed this being > discussed. > >> @@ -4105,39 +4105,55 @@ mips_subword (rtx op, bool high_p) >> return simplify_gen_subreg (word_mode, op, mode, byte); >> } >> >> -/* Return true if a 64-bit move from SRC to DEST should be split into two. >> */ >> +/* Return true if SRC can be moved into DEST using MULT $0, $0. */ >> + >> +static bool >> +mips_mult_move_p (rtx dest, rtx src) >> +{ >> + return (src == const0_rtx >> + && REG_P (dest) >> + && GET_MODE_SIZE (GET_MODE (dest)) == 2 * UNITS_PER_WORD >> + && (ISA_HAS_DSP_MULT >> + ? ACC_REG_P (REGNO (dest)) >> + : MD_REG_P (REGNO (dest)))); >> +} >> + >> +/* Return true if a move from SRC to DEST should be split into two. */ > > Does the DSP ASE guarantee that a MULT $0, $0 is going not to be slower > than MTHI $0/MTLO $0? The latency of multiplication varies among > implementations, for example the original R3000 took 12 cycles (of course > the R3000 itself is not relevant for this change, but you see the > picture!). On the other hand in some (but not all!) processors > multiplication runs in parallel to the main pipeline so it is the > difference, if positive, between the number of cycles consumed by other > instructions up to the next HI/LO access instruction and the latency of > MULT run in the background that matters. > > From the context I am assuming none of this matters for the 74K (and > presumably the 24KE/34K) and a MULT $0, $0 is indeed faster, but overall > isn't it something that should be decided based on instruction costs from > DFA schedulers? Is there anything that I've missed here? It doesn't > appear to me your (and neither the original) proposal takes instruction > cost calculation into consideration.
In practice, we only move 0 into HI and LO for MADD- and MSUB-style operations. We deliberately don't use HI and LO as scratch space. I think it's a reasonable default assumption that anything that supports those instructions also has a fast path from MULT to MADD or MULT to MSUB. I certainly don't know of any counter-examples. The decision is deliberately centeralised in one place so that the condition can be tweaked in future if necessary. Richard