On 15 November 2011 09:19, Richard Sandiford
<richard.sandif...@linaro.org> wrote:
> Revital Eres <revital.e...@linaro.org> writes:
>>> chain, so what makes the SMS version of it worse than the non-SMS version?
>>
>> I attached the SMS dump file. The problematic loop is the one with
>> "SMS succeeded 36 2" (there are three loops in total in this file).
>> Due to these accumulators min ii is 36 which seems to cause SMS to
>> take wrong decisions.
>>
>> SMS iis 36 36 72 (rec_mii, mii, maxii)
>
> OK, so the minimum ii comes from each dependency in the chain of
> 4 accumulations having a latency of 9 cycles.  But the A9 TRM says:
>
>    If a multiply-accumulate follows a multiply or another
>    multiply-accumulate, and depends on the result of that first
>    instruction, then if the dependency between both instructions are of the
>    same type and size, the processor uses a special multiplier accumulator
>    forwarding. This special forwarding means the multiply instructions can
>    issue back-to-back because the result of the first instruction in cycle
>    5 is forwarded to the accumulator of the second instruction in cycle
>    4. If the size and type of the instructions do not match, then Dd or Qd
>    is required in cycle 3. This applies to combinations of the
>    multiply-accumulate instructions VMLA, VMLS, VQDMLA, and VQDMLS, and the
>    multiply instructions VMUL andVQDMUL.
>
> So I think the problem is that successive VMLAs don't in fact have a
> latency of 9.  However, this doesn't seem to be modelled in the ARM
> backend, either through bypasses or in a sched-reorder hook.
> In contrast, the A8 pipeline description has:

This should be identical for both the A8 and A9 descriptions.

;; Instructions using this reservation read their (D|Q)n operands at N2,
;; their (D|Q)m operands at N1, their (D|Q)d operands at N3, and
;; produce a result at N6 on cycle 4.
(define_insn_reservation "cortex_a8_neon_mla_qqq_32_qqd_32_scalar" 9
  (and (eq_attr "tune" "cortexa8")
       (eq_attr "neon_type" "neon_mla_qqq_32_qqd_32_scalar"))
  "cortex_a8_neon_dp_4")

I thought I spotted the bypass for this but you are right, there is no
bypass that handles this particular case.


>
> ;; A multiply with a single-register result or an MLA, followed by an
> ;; MLA with an accumulator dependency, has its result forwarded so two
> ;; such instructions can issue back-to-back.
> (define_bypass 1 "cortex_a8_mul,cortex_a8_mla,cortex_a8_smulwy"
>               "cortex_a8_mla"
>               "arm_mac_accumulator_is_mul_result")
>

But that is modelling only scalar bypasses for the A8 indicating a
back to back issue of a multiply followed by an mla. The A9
descriptions should handle this with appropriate issue restrictions.

> I'm not sure from the A9 description whether "following" means
> "immediately following", or whether gaps between instructions are
> allowed (and, in the latter case, whether the gap can be filled with
> arbitrary instructions, or whether restrictions apply, such as
> "anything but another NEON multiplication").  Ramana, do you know?

I don't know the answer to that specific question and will have to try
a few experiments.
>
> Anyway, I think this explains why the non-SMS loop executes more
> quickly than GCC expects, and why the SMS loop is slower than it
> needs to be.  It might be worth comparing the two loops with
> -mtune=cortex-a8.
>
> Richard
>

_______________________________________________
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain

Reply via email to