Re: PING Re: [PATCH, MIPS] add new peephole for 74k dspr2

Richard Sandiford Mon, 24 Sep 2012 13:48:57 -0700

"Maciej W. Rozycki" <ma...@codesourcery.com> writes:
> On Tue, 18 Sep 2012, Richard Sandiford wrote:
>
>> > Have you had time to think about this some more?  I am not sure I can 
>> > guess how you'd like me to fix this patch now without some more specific 
>> > review and/or suggestions about where the optimization should happen and 
>> > what cases it should be extended to detect in addition to the dsp 
>> > accumulator multiplies.
>> 
>> The patch below is the one I've been testing.  But I got sidetracked
>> by looking into the possibility of removing the MD0_REG and MD1_REG
>> classes, in order to get more sensible costs.  I think that was needed
>> for the madd-9.c test to pass.
>
>  Sorry to come up with this so late -- I have only now noticed this being 
> discussed.
>
>> @@ -4105,39 +4105,55 @@ mips_subword (rtx op, bool high_p)
>>    return simplify_gen_subreg (word_mode, op, mode, byte);
>>  }
>>  
>> -/* Return true if a 64-bit move from SRC to DEST should be split into two.  
>> */
>> +/* Return true if SRC can be moved into DEST using MULT $0, $0.  */
>> +
>> +static bool
>> +mips_mult_move_p (rtx dest, rtx src)
>> +{
>> +  return (src == const0_rtx
>> +      && REG_P (dest)
>> +      && GET_MODE_SIZE (GET_MODE (dest)) == 2 * UNITS_PER_WORD
>> +      && (ISA_HAS_DSP_MULT
>> +          ? ACC_REG_P (REGNO (dest))
>> +          : MD_REG_P (REGNO (dest))));
>> +}
>> +
>> +/* Return true if a move from SRC to DEST should be split into two.  */
>
>  Does the DSP ASE guarantee that a MULT $0, $0 is going not to be slower 
> than MTHI $0/MTLO $0?  The latency of multiplication varies among 
> implementations, for example the original R3000 took 12 cycles (of course 
> the R3000 itself is not relevant for this change, but you see the 
> picture!).  On the other hand in some (but not all!) processors 
> multiplication runs in parallel to the main pipeline so it is the 
> difference, if positive, between the number of cycles consumed by other 
> instructions up to the next HI/LO access instruction and the latency of 
> MULT run in the background that matters.
>
>  From the context I am assuming none of this matters for the 74K (and 
> presumably the 24KE/34K) and a MULT $0, $0 is indeed faster, but overall 
> isn't it something that should be decided based on instruction costs from 
> DFA schedulers?  Is there anything that I've missed here?  It doesn't 
> appear to me your (and neither the original) proposal takes instruction 
> cost calculation into consideration.


In practice, we only move 0 into HI and LO for MADD- and MSUB-style
operations.  We deliberately don't use HI and LO as scratch space.

I think it's a reasonable default assumption that anything that supports
those instructions also has a fast path from MULT to MADD or MULT to MSUB.
I certainly don't know of any counter-examples.  The decision is deliberately
centeralised in one place so that the condition can be tweaked in future
if necessary.

Richard

Re: PING Re: [PATCH, MIPS] add new peephole for 74k dspr2

Reply via email to