Richard Sandiford <rdsandif...@googlemail.com> writes:
> "Maciej W. Rozycki" <ma...@codesourcery.com> writes:
>> On Mon, 24 Sep 2012, Richard Sandiford wrote:
>>
>>> >  From the context I am assuming none of this matters for the 74K (and 
>>> > presumably the 24KE/34K) and a MULT $0, $0 is indeed faster, but overall 
>>> > isn't it something that should be decided based on instruction costs from 
>>> > DFA schedulers?  Is there anything that I've missed here?  It doesn't 
>>> > appear to me your (and neither the original) proposal takes instruction 
>>> > cost calculation into consideration.
>>> 
>>> In practice, we only move 0 into HI and LO for MADD- and MSUB-style
>>> operations.  We deliberately don't use HI and LO as scratch space.
>>> 
>>> I think it's a reasonable default assumption that anything that supports
>>> those instructions also has a fast path from MULT to MADD or MULT to MSUB.
>>
>>  According to my sources the R4650 has a 4-cycle MULT latency (MAD is 3-4 
>> cycles on that processor).  An MTHI/MTLO pair will take 2 cycles; 
>> obviously the resulting larger code may adversely affect cache performance 
>> in some scenarios.
>
> That's not how the 4650 DFA models it though.
>
> (define_insn_reservation "generic_hilo" 1
>   (eq_attr "type" "mfhi,mflo,mthi,mtlo")
>   "imuldiv*3")
>
> (define_insn_reservation "r4650_imul" 4
>   (and (eq_attr "cpu" "r4650")
>        (eq_attr "type" "imul,imul3,imadd"))
>   "imuldiv*4")
>
> So if we believed the DFA, MTLO + MTHI would occupy the muldiv unit for 6
> rather than 4 cycles.  Any attempt to use the DFA would still favour MULT.

Although I see the 4kp with its 32-cycle MULTs and MADDs is one where
MULT $0,$0 would be a really bad choice.  Sigh.  The amount of effort
required for this optimisation is getting a bit ridiculous.

Richard

Reply via email to