Ping. Could x86 maintainer(s) look at these changes? Thanks, Igor
On Fri, Apr 20, 2012 at 4:04 PM, Igor Zamyatin <izamya...@gmail.com> wrote: > On Tue, Apr 17, 2012 at 12:27 AM, Igor Zamyatin <izamya...@gmail.com> wrote: >> On Fri, Apr 13, 2012 at 4:20 PM, Andrey Belevantsev <a...@ispras.ru> wrote: >>> On 13.04.2012 14:18, Igor Zamyatin wrote: >>>> >>>> On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantsev<a...@ispras.ru> >>>> wrote: >>>>> >>>>> On 12.04.2012 16:38, Richard Guenther wrote: >>>>>> >>>>>> >>>>>> On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatin<izamya...@gmail.com> >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther >>>>>>> <richard.guent...@gmail.com> wrote: >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov<amona...@ispras.ru> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> Can atom execute two IMUL in parallel? Or what exactly is the >>>>>>>>>> pipeline >>>>>>>>>> behavior? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> As I understand from Intel's optimization reference manual, the >>>>>>>>> behavior is as >>>>>>>>> follows: if the instruction immediately following IMUL has shorter >>>>>>>>> latency, >>>>>>>>> execution is stalled for 4 cycles (which is IMUL's latency); >>>>>>>>> otherwise, >>>>>>>>> a >>>>>>>>> 4-or-more cycles latency instruction can be issued after IMUL without >>>>>>>>> a >>>>>>>>> stall. >>>>>>>>> In other words, IMUL is pipelined with respect to other long-latency >>>>>>>>> instructions, but not to short-latency instructions. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> It seems to be modeled in the pipeline description though: >>>>>>>> >>>>>>>> ;;; imul insn has 5 cycles latency >>>>>>>> (define_reservation "atom-imul-32" >>>>>>>> "atom-imul-1, atom-imul-2, atom-imul-3, >>>>>>>> atom-imul-4, >>>>>>>> atom-port-0") >>>>>>>> >>>>>>>> ;;; imul instruction excludes other non-FP instructions. >>>>>>>> (exclusion_set "atom-eu-0, atom-eu-1" >>>>>>>> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4") >>>>>>>> >>>>>>> >>>>>>> The main idea is quite simple: >>>>>>> >>>>>>> If we are going to schedule IMUL instruction (it is on the top of >>>>>>> ready list) we try to find out producer of other (independent) IMUL >>>>>>> instruction that is in ready list too. The goal is try to schedule >>>>>>> such a producer to get another IMUL in ready list and get scheduling >>>>>>> of 2 successive IMUL instructions. >>>>>> >>>>>> >>>>>> >>>>>> Why does that not happen without your patch? Does it never happen >>>>>> without >>>>>> your patch or does it merely not happen for one EEMBC benchmark (can >>>>>> you provide a testcase?)? >>>>> >>>>> >>>>> >>>>> It does not happen because the scheduler by itself does not do such >>>>> specific >>>>> reordering. That said, it is easy to imagine the cases where this patch >>>>> will make things worse rather than better. >>>>> >>>>> Igor, why not try different subtler mechanisms like adjust_priority, >>>>> which >>>>> is get called when an insn is added to the ready list? E.g. increase the >>>>> producer's priority. >>>>> >>>>> The patch as is misses checks for NONDEBUG_INSN_P. Also, why bail out >>>>> when >>>>> you have more than one imul in the ready list? Don't you want to bump >>>>> the >>>>> priority of the other imul found? >>>> >>>> >>>> Could you provide some examples when this patch would harm the >>>> performance? >>> >>> >>> I thought of the cases when the other ready insns can fill up the hole and >>> that would be more beneficial because e.g. they would be on more critical >>> paths than the producer of your second imul. I don't know enough of Atom to >>> give an example -- maybe some long divisions? >>> >>> >>>> >>>> Sched_reorder was chosen since it is used in other ports and looks >>>> most suitable for such case, e.g. it provides access to the whole >>>> ready list. >>>> BTW, just increasing producer's priority seems to be more risky in >>>> performance sense - we can incorrectly start delaying some >>>> instructions. >>> >>> >>> Yes, but exactly because of the above example you can start incorrectly >>> delaying other insns, too, as you force the insn to be the first in the >>> list. While bumping priority still leaves the scheduler sorting heuristics >>> in place and actually lowers that risk. >>> >>>> >>>> Thought ready list doesn't contain DEBUG_INSN... Is it so? If it >>>> contains them - this could be added easily >>> >>> >>> It does, but I'm not sure the sched_reorder hook gets them or they are >>> immediately removed -- I saw similar checks in one of the targets' hooks. >> >> Done with DEBUG_INSN, also 1-imul limit was removed. Patch attached >> >>> >>> Anyways, my main thought was that it is better to test on more benchmarks to >>> alleviate the above concerns, so as long as the i386 maintainers are happy, >>> I don't see major problems here. A good idea could be to generalize the >>> patch to handle other long latency insns as second consumers, not only >>> imuls, if this is relevant for Atom. >> >> Yes, generalization of this approach is in plans. According to Atom >> Software optimization guide there are several headrooms left here. >> As for trying on more benchmarks - the particular case is indeed quite >> rare. I attached the example >> where patch helps to group imuls in pairs which is profitable for >> Atom. Such and similar codes are not very common. >> But hopefully this approach could help avoid this and other glassjaws. > > BTW, this patch also helps some EEMBC tests when funroll-loops specified. > So, any feedback from i386 maintainers about this? :) > > Changelog slightly changed > > > 2012-04-10 Yuri Rumyantsev <yuri.s.rumyant...@intel.com> > > * config/i386/i386.c (x86_sched_reorder): New function. > Added new function x86_sched_reorder > to take advantage of Atom pipelened IMUL execution. > > >> >>> >>> Andrey >>> >>> >>>> >>>> The case with more than one imul in the ready list wasn't considered >>>> just because the goal was to catch the particular case when there is a >>>> risk to get the following picture: "imul-producer-imul" which is less >>>> effective than "producer-imul-imul" for Atom >>>> >>>>> >>>>> Andrey >>>>> >>>>> >>>>>> >>>>>>> And MD allows us only prefer scheduling of successive IMUL >>>>>>> instructions, >>>>>>> i.e. >>>>>>> If IMUL was just scheduled and ready list contains another IMUL >>>>>>> instruction then it will be chosen as the best candidate for >>>>>>> scheduling. >>>>>>> >>>>>>> >>>>>>>> at least from my very limited guessing of what the above does. So, >>>>>>>> did >>>>>>>> you >>>>>>>> analyze why the scheduler does not properly schedule things for you? >>>>>>>> >>>>>>>> Richard. >>>>>>>> >>>>>>>>> >>>>>>>>> From reading the patch, I could not understand the link between >>>>>>>>> pipeline >>>>>>>>> behavior and what the patch appears to do. >>>>>>>>> >>>>>>>>> Hope that helps. >>>>>>>>> >>>>>>>>> Alexander >>>>> >>>>> >>>>> >>>> >>>> Thanks, >>>> Igor >>> >>>