Re: [PATCH] Atom: Scheduler improvements for better imul placement

Igor Zamyatin Fri, 20 Apr 2012 05:05:18 -0700

On Tue, Apr 17, 2012 at 12:27 AM, Igor Zamyatin <izamya...@gmail.com> wrote:
> On Fri, Apr 13, 2012 at 4:20 PM, Andrey Belevantsev <a...@ispras.ru> wrote:
>> On 13.04.2012 14:18, Igor Zamyatin wrote:
>>>
>>> On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantsev<a...@ispras.ru>
>>>  wrote:
>>>>
>>>> On 12.04.2012 16:38, Richard Guenther wrote:
>>>>>
>>>>>
>>>>> On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatin<izamya...@gmail.com>
>>>>>  wrote:
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther
>>>>>> <richard.guent...@gmail.com>    wrote:
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov<amona...@ispras.ru>
>>>>>>>  wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Can atom execute two IMUL in parallel?  Or what exactly is the
>>>>>>>>> pipeline
>>>>>>>>> behavior?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> As I understand from Intel's optimization reference manual, the
>>>>>>>> behavior is as
>>>>>>>> follows: if the instruction immediately following IMUL has shorter
>>>>>>>> latency,
>>>>>>>> execution is stalled for 4 cycles (which is IMUL's latency);
>>>>>>>> otherwise,
>>>>>>>> a
>>>>>>>> 4-or-more cycles latency instruction can be issued after IMUL without
>>>>>>>> a
>>>>>>>> stall.
>>>>>>>> In other words, IMUL is pipelined with respect to other long-latency
>>>>>>>> instructions, but not to short-latency instructions.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> It seems to be modeled in the pipeline description though:
>>>>>>>
>>>>>>> ;;; imul insn has 5 cycles latency
>>>>>>> (define_reservation "atom-imul-32"
>>>>>>>                    "atom-imul-1, atom-imul-2, atom-imul-3,
>>>>>>> atom-imul-4,
>>>>>>>                     atom-port-0")
>>>>>>>
>>>>>>> ;;; imul instruction excludes other non-FP instructions.
>>>>>>> (exclusion_set "atom-eu-0, atom-eu-1"
>>>>>>>               "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4")
>>>>>>>
>>>>>>
>>>>>> The main idea is quite simple:
>>>>>>
>>>>>> If we are going to schedule IMUL instruction (it is on the top of
>>>>>> ready list) we try to find out producer of other (independent) IMUL
>>>>>> instruction that is in ready list too. The goal is try to schedule
>>>>>> such a producer to get another IMUL in ready list and get scheduling
>>>>>> of 2 successive IMUL instructions.
>>>>>
>>>>>
>>>>>
>>>>> Why does that not happen without your patch?  Does it never happen
>>>>> without
>>>>> your patch or does it merely not happen for one EEMBC benchmark (can
>>>>> you provide a testcase?)?
>>>>
>>>>
>>>>
>>>> It does not happen because the scheduler by itself does not do such
>>>> specific
>>>> reordering.  That said, it is easy to imagine the cases where this patch
>>>> will make things worse rather than better.
>>>>
>>>> Igor, why not try different subtler mechanisms like adjust_priority,
>>>> which
>>>> is get called when an insn is added to the ready list?  E.g. increase the
>>>> producer's priority.
>>>>
>>>> The patch as is misses checks for NONDEBUG_INSN_P.  Also, why bail out
>>>> when
>>>> you have more than one imul in the ready list?  Don't you want to bump
>>>> the
>>>> priority of the other imul found?
>>>
>>>
>>> Could you provide some examples when this patch would harm the
>>> performance?
>>
>>
>> I thought of the cases when the other ready insns can fill up the hole and
>> that would be more beneficial because e.g. they would be on more critical
>> paths than the producer of your second imul.  I don't know enough of Atom to
>> give an example -- maybe some long divisions?
>>
>>
>>>
>>> Sched_reorder was chosen since it is used in other ports and looks
>>> most suitable for such case, e.g. it provides access to the whole
>>> ready list.
>>> BTW, just increasing producer's priority seems to be more risky in
>>> performance sense - we can incorrectly start delaying some
>>> instructions.
>>
>>
>> Yes, but exactly because of the above example you can start incorrectly
>> delaying other insns, too, as you force the insn to be the first in the
>> list.  While bumping priority still leaves the scheduler sorting heuristics
>> in place and actually lowers that risk.
>>
>>>
>>> Thought ready list doesn't contain DEBUG_INSN... Is it so? If it
>>> contains them - this could be added easily
>>
>>
>> It does, but I'm not sure the sched_reorder hook gets them or they are
>> immediately removed -- I saw similar checks in one of the targets' hooks.
>
> Done with DEBUG_INSN, also 1-imul limit was removed. Patch attached
>
>>
>> Anyways, my main thought was that it is better to test on more benchmarks to
>> alleviate the above concerns, so as long as the i386 maintainers are happy,
>> I don't see major problems here.  A good idea could be to generalize the
>> patch to handle other long latency insns as second consumers, not only
>> imuls, if this is relevant for Atom.
>
> Yes, generalization of this approach is in plans. According to Atom
> Software optimization guide there are several headrooms left here.
> As for trying on more benchmarks - the particular case is indeed quite
> rare. I attached the example
> where patch helps to group imuls in pairs which is profitable for
> Atom. Such and similar codes are not very common.
> But hopefully this approach could help avoid this and other glassjaws.


BTW, this patch also helps some EEMBC tests when funroll-loops specified.
So, any feedback from i386 maintainers about this? :)

Changelog slightly changed


2012-04-10  Yuri Rumyantsev  <yuri.s.rumyant...@intel.com>

        * config/i386/i386.c (x86_sched_reorder): New function.
        Added new function x86_sched_reorder
        to take advantage of Atom pipelened IMUL execution.


>
>>
>> Andrey
>>
>>
>>>
>>> The case with more than one imul in the ready list wasn't considered
>>> just because the goal was to catch the particular case when there is a
>>> risk to get the following picture: "imul-producer-imul" which is less
>>> effective than "producer-imul-imul" for Atom
>>>
>>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>>> And MD allows us only prefer scheduling of successive IMUL
>>>>>> instructions,
>>>>>> i.e.
>>>>>> If IMUL was just scheduled and ready list contains another IMUL
>>>>>> instruction then it will be chosen as the best candidate for
>>>>>> scheduling.
>>>>>>
>>>>>>
>>>>>>> at least from my very limited guessing of what the above does.  So,
>>>>>>> did
>>>>>>> you
>>>>>>> analyze why the scheduler does not properly schedule things for you?
>>>>>>>
>>>>>>> Richard.
>>>>>>>
>>>>>>>>
>>>>>>>>  From reading the patch, I could not understand the link between
>>>>>>>> pipeline
>>>>>>>> behavior and what the patch appears to do.
>>>>>>>>
>>>>>>>> Hope that helps.
>>>>>>>>
>>>>>>>> Alexander
>>>>
>>>>
>>>>
>>>
>>> Thanks,
>>> Igor
>>
>>

Re: [PATCH] Atom: Scheduler improvements for better imul placement

Reply via email to