On Tue, Apr 17, 2012 at 12:27 AM, Igor Zamyatin <izamya...@gmail.com> wrote: > On Fri, Apr 13, 2012 at 4:20 PM, Andrey Belevantsev <a...@ispras.ru> wrote: >> On 13.04.2012 14:18, Igor Zamyatin wrote: >>> >>> On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantsev<a...@ispras.ru> >>> wrote: >>>> >>>> On 12.04.2012 16:38, Richard Guenther wrote: >>>>> >>>>> >>>>> On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatin<izamya...@gmail.com> >>>>> wrote: >>>>>> >>>>>> >>>>>> On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther >>>>>> <richard.guent...@gmail.com> wrote: >>>>>>> >>>>>>> >>>>>>> On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov<amona...@ispras.ru> >>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Can atom execute two IMUL in parallel? Or what exactly is the >>>>>>>>> pipeline >>>>>>>>> behavior? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> As I understand from Intel's optimization reference manual, the >>>>>>>> behavior is as >>>>>>>> follows: if the instruction immediately following IMUL has shorter >>>>>>>> latency, >>>>>>>> execution is stalled for 4 cycles (which is IMUL's latency); >>>>>>>> otherwise, >>>>>>>> a >>>>>>>> 4-or-more cycles latency instruction can be issued after IMUL without >>>>>>>> a >>>>>>>> stall. >>>>>>>> In other words, IMUL is pipelined with respect to other long-latency >>>>>>>> instructions, but not to short-latency instructions. >>>>>>> >>>>>>> >>>>>>> >>>>>>> It seems to be modeled in the pipeline description though: >>>>>>> >>>>>>> ;;; imul insn has 5 cycles latency >>>>>>> (define_reservation "atom-imul-32" >>>>>>> "atom-imul-1, atom-imul-2, atom-imul-3, >>>>>>> atom-imul-4, >>>>>>> atom-port-0") >>>>>>> >>>>>>> ;;; imul instruction excludes other non-FP instructions. >>>>>>> (exclusion_set "atom-eu-0, atom-eu-1" >>>>>>> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4") >>>>>>> >>>>>> >>>>>> The main idea is quite simple: >>>>>> >>>>>> If we are going to schedule IMUL instruction (it is on the top of >>>>>> ready list) we try to find out producer of other (independent) IMUL >>>>>> instruction that is in ready list too. The goal is try to schedule >>>>>> such a producer to get another IMUL in ready list and get scheduling >>>>>> of 2 successive IMUL instructions. >>>>> >>>>> >>>>> >>>>> Why does that not happen without your patch? Does it never happen >>>>> without >>>>> your patch or does it merely not happen for one EEMBC benchmark (can >>>>> you provide a testcase?)? >>>> >>>> >>>> >>>> It does not happen because the scheduler by itself does not do such >>>> specific >>>> reordering. That said, it is easy to imagine the cases where this patch >>>> will make things worse rather than better. >>>> >>>> Igor, why not try different subtler mechanisms like adjust_priority, >>>> which >>>> is get called when an insn is added to the ready list? E.g. increase the >>>> producer's priority. >>>> >>>> The patch as is misses checks for NONDEBUG_INSN_P. Also, why bail out >>>> when >>>> you have more than one imul in the ready list? Don't you want to bump >>>> the >>>> priority of the other imul found? >>> >>> >>> Could you provide some examples when this patch would harm the >>> performance? >> >> >> I thought of the cases when the other ready insns can fill up the hole and >> that would be more beneficial because e.g. they would be on more critical >> paths than the producer of your second imul. I don't know enough of Atom to >> give an example -- maybe some long divisions? >> >> >>> >>> Sched_reorder was chosen since it is used in other ports and looks >>> most suitable for such case, e.g. it provides access to the whole >>> ready list. >>> BTW, just increasing producer's priority seems to be more risky in >>> performance sense - we can incorrectly start delaying some >>> instructions. >> >> >> Yes, but exactly because of the above example you can start incorrectly >> delaying other insns, too, as you force the insn to be the first in the >> list. While bumping priority still leaves the scheduler sorting heuristics >> in place and actually lowers that risk. >> >>> >>> Thought ready list doesn't contain DEBUG_INSN... Is it so? If it >>> contains them - this could be added easily >> >> >> It does, but I'm not sure the sched_reorder hook gets them or they are >> immediately removed -- I saw similar checks in one of the targets' hooks. > > Done with DEBUG_INSN, also 1-imul limit was removed. Patch attached > >> >> Anyways, my main thought was that it is better to test on more benchmarks to >> alleviate the above concerns, so as long as the i386 maintainers are happy, >> I don't see major problems here. A good idea could be to generalize the >> patch to handle other long latency insns as second consumers, not only >> imuls, if this is relevant for Atom. > > Yes, generalization of this approach is in plans. According to Atom > Software optimization guide there are several headrooms left here. > As for trying on more benchmarks - the particular case is indeed quite > rare. I attached the example > where patch helps to group imuls in pairs which is profitable for > Atom. Such and similar codes are not very common. > But hopefully this approach could help avoid this and other glassjaws.
BTW, this patch also helps some EEMBC tests when funroll-loops specified. So, any feedback from i386 maintainers about this? :) Changelog slightly changed 2012-04-10 Yuri Rumyantsev <yuri.s.rumyant...@intel.com> * config/i386/i386.c (x86_sched_reorder): New function. Added new function x86_sched_reorder to take advantage of Atom pipelened IMUL execution. > >> >> Andrey >> >> >>> >>> The case with more than one imul in the ready list wasn't considered >>> just because the goal was to catch the particular case when there is a >>> risk to get the following picture: "imul-producer-imul" which is less >>> effective than "producer-imul-imul" for Atom >>> >>>> >>>> Andrey >>>> >>>> >>>>> >>>>>> And MD allows us only prefer scheduling of successive IMUL >>>>>> instructions, >>>>>> i.e. >>>>>> If IMUL was just scheduled and ready list contains another IMUL >>>>>> instruction then it will be chosen as the best candidate for >>>>>> scheduling. >>>>>> >>>>>> >>>>>>> at least from my very limited guessing of what the above does. So, >>>>>>> did >>>>>>> you >>>>>>> analyze why the scheduler does not properly schedule things for you? >>>>>>> >>>>>>> Richard. >>>>>>> >>>>>>>> >>>>>>>> From reading the patch, I could not understand the link between >>>>>>>> pipeline >>>>>>>> behavior and what the patch appears to do. >>>>>>>> >>>>>>>> Hope that helps. >>>>>>>> >>>>>>>> Alexander >>>> >>>> >>>> >>> >>> Thanks, >>> Igor >> >>