Richard Sandiford <rdsandif...@googlemail.com> writes: > "Maciej W. Rozycki" <ma...@codesourcery.com> writes: >> On Mon, 24 Sep 2012, Richard Sandiford wrote: >> >>> > From the context I am assuming none of this matters for the 74K (and >>> > presumably the 24KE/34K) and a MULT $0, $0 is indeed faster, but overall >>> > isn't it something that should be decided based on instruction costs from >>> > DFA schedulers? Is there anything that I've missed here? It doesn't >>> > appear to me your (and neither the original) proposal takes instruction >>> > cost calculation into consideration. >>> >>> In practice, we only move 0 into HI and LO for MADD- and MSUB-style >>> operations. We deliberately don't use HI and LO as scratch space. >>> >>> I think it's a reasonable default assumption that anything that supports >>> those instructions also has a fast path from MULT to MADD or MULT to MSUB. >> >> According to my sources the R4650 has a 4-cycle MULT latency (MAD is 3-4 >> cycles on that processor). An MTHI/MTLO pair will take 2 cycles; >> obviously the resulting larger code may adversely affect cache performance >> in some scenarios. > > That's not how the 4650 DFA models it though. > > (define_insn_reservation "generic_hilo" 1 > (eq_attr "type" "mfhi,mflo,mthi,mtlo") > "imuldiv*3") > > (define_insn_reservation "r4650_imul" 4 > (and (eq_attr "cpu" "r4650") > (eq_attr "type" "imul,imul3,imadd")) > "imuldiv*4") > > So if we believed the DFA, MTLO + MTHI would occupy the muldiv unit for 6 > rather than 4 cycles. Any attempt to use the DFA would still favour MULT.
Although I see the 4kp with its 32-cycle MULTs and MADDs is one where MULT $0,$0 would be a really bad choice. Sigh. The amount of effort required for this optimisation is getting a bit ridiculous. Richard