On 15 November 2011 09:19, Richard Sandiford <richard.sandif...@linaro.org> wrote: > Revital Eres <revital.e...@linaro.org> writes: >>> chain, so what makes the SMS version of it worse than the non-SMS version? >> >> I attached the SMS dump file. The problematic loop is the one with >> "SMS succeeded 36 2" (there are three loops in total in this file). >> Due to these accumulators min ii is 36 which seems to cause SMS to >> take wrong decisions. >> >> SMS iis 36 36 72 (rec_mii, mii, maxii) > > OK, so the minimum ii comes from each dependency in the chain of > 4 accumulations having a latency of 9 cycles. But the A9 TRM says: > > If a multiply-accumulate follows a multiply or another > multiply-accumulate, and depends on the result of that first > instruction, then if the dependency between both instructions are of the > same type and size, the processor uses a special multiplier accumulator > forwarding. This special forwarding means the multiply instructions can > issue back-to-back because the result of the first instruction in cycle > 5 is forwarded to the accumulator of the second instruction in cycle > 4. If the size and type of the instructions do not match, then Dd or Qd > is required in cycle 3. This applies to combinations of the > multiply-accumulate instructions VMLA, VMLS, VQDMLA, and VQDMLS, and the > multiply instructions VMUL andVQDMUL. > > So I think the problem is that successive VMLAs don't in fact have a > latency of 9. However, this doesn't seem to be modelled in the ARM > backend, either through bypasses or in a sched-reorder hook. > In contrast, the A8 pipeline description has:
This should be identical for both the A8 and A9 descriptions. ;; Instructions using this reservation read their (D|Q)n operands at N2, ;; their (D|Q)m operands at N1, their (D|Q)d operands at N3, and ;; produce a result at N6 on cycle 4. (define_insn_reservation "cortex_a8_neon_mla_qqq_32_qqd_32_scalar" 9 (and (eq_attr "tune" "cortexa8") (eq_attr "neon_type" "neon_mla_qqq_32_qqd_32_scalar")) "cortex_a8_neon_dp_4") I thought I spotted the bypass for this but you are right, there is no bypass that handles this particular case. > > ;; A multiply with a single-register result or an MLA, followed by an > ;; MLA with an accumulator dependency, has its result forwarded so two > ;; such instructions can issue back-to-back. > (define_bypass 1 "cortex_a8_mul,cortex_a8_mla,cortex_a8_smulwy" > "cortex_a8_mla" > "arm_mac_accumulator_is_mul_result") > But that is modelling only scalar bypasses for the A8 indicating a back to back issue of a multiply followed by an mla. The A9 descriptions should handle this with appropriate issue restrictions. > I'm not sure from the A9 description whether "following" means > "immediately following", or whether gaps between instructions are > allowed (and, in the latter case, whether the gap can be filled with > arbitrary instructions, or whether restrictions apply, such as > "anything but another NEON multiplication"). Ramana, do you know? I don't know the answer to that specific question and will have to try a few experiments. > > Anyway, I think this explains why the non-SMS loop executes more > quickly than GCC expects, and why the SMS loop is slower than it > needs to be. It might be worth comparing the two loops with > -mtune=cortex-a8. > > Richard > _______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain