On Mon, Aug 5, 2019 at 11:13 AM Richard Sandiford <richard.sandif...@arm.com> wrote: > > Uros Bizjak <ubiz...@gmail.com> writes: > > On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguent...@suse.de> wrote: > >> > >> On Thu, 1 Aug 2019, Uros Bizjak wrote: > >> > >> > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguent...@suse.de> wrote: > >> > > >> >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks > >> >>>> necessary even when going the STV route. The actual regression > >> >>>> for the testcase could also be solved by turing the smaxsi3 > >> >>>> back into a compare and jump rather than a conditional move sequence. > >> >>>> So I wonder how you'd do that given that there's pass_if_after_reload > >> >>>> after pass_split_after_reload and I'm not sure we can split > >> >>>> as late as pass_split_before_sched2 (there's also a split _after_ > >> >>>> sched2 on x86 it seems). > >> >>>> > >> >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the > >> >>>> case STV doesn't end up doing any transform? > >> >>> > >> >>> If STV doesn't transform the insn, then a pre-reload splitter splits > >> >>> the insn back to compare+cmove. > >> >> > >> >> OK, that would work. But there's no way to force a jumpy sequence then > >> >> which we know is faster than compare+cmove because later RTL > >> >> if-conversion passes happily re-discover the smax (or conditional move) > >> >> sequence. > >> >> > >> >>> However, considering the SImode move > >> >>> from/to int/xmm register is relatively cheap, the cost function should > >> >>> be tuned so that STV always converts smaxsi3 pattern. > >> >> > >> >> Note that on both Zen and even more so bdverN the int/xmm transition > >> >> makes it no longer profitable but a _lot_ slower than the cmp/cmov > >> >> sequence... (for the loop in hmmer which is the only one I see > >> >> any effect of any of my patches). So identifying chains that > >> >> start/end in memory is important for cost reasons. > >> > > >> > Please note that the cost function also considers the cost of move > >> > from/to xmm. So, the cost of the whole chain would disable the > >> > transformation. > >> > > >> >> So I think the splitting has to happen after the last if-conversion > >> >> pass (and thus we may need to allocate a scratch register for this > >> >> purpose?) > >> > > >> > I really hope that the underlying issue will be solved by a machine > >> > dependant pass inserted somewhere after the pre-reload split. This > >> > way, we can split unconverted smax to the cmove, and this later pass > >> > would handle jcc and cmove instructions. Until then... yes your > >> > proposed approach is one of the ways to avoid unwanted if-conversion, > >> > although sometimes we would like to split to cmove instead. > >> > >> So the following makes STV also consider SImode chains, re-using the > >> DImode chain code. I've kept a simple incomplete smaxsi3 pattern > >> and also did not alter the {SI,DI}mode chain cost function - it's > >> quite off for TARGET_64BIT. With this I get the expected conversion > >> for the testcase derived from hmmer. > >> > >> No further testing sofar. > >> > >> Is it OK to re-use the DImode chain code this way? I'll clean things > >> up some more of course. > > > > Yes, the approach looks OK to me. It makes chain building mode > > agnostic, and the chain building can be used for > > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added. > > b) SImode x86_32 and x86_64 (this will be mainly used for SImode > > minmax and surrounding SImode operations) > > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding > > DImode operations) > > > >> Still need help with the actual patterns for minmax and how the splitters > >> should look like. > > > > Please look at the attached patch. Maybe we can add memory_operand as > > operand 1 and operand 2 predicate, but let's keep things simple for > > now. > > > > Uros. > > > > Index: i386.md > > =================================================================== > > --- i386.md (revision 274008) > > +++ i386.md (working copy) > > @@ -17721,6 +17721,27 @@ > > std::swap (operands[4], operands[5]); > > }) > > > > +;; min/max patterns > > + > > +(define_code_attr smaxmin_rel [(smax "ge") (smin "le")]) > > + > > +(define_insn_and_split "<code><mode>3" > > + [(set (match_operand:SWI48 0 "register_operand") > > + (smaxmin:SWI48 (match_operand:SWI48 1 "register_operand") > > + (match_operand:SWI48 2 "register_operand"))) > > + (clobber (reg:CC FLAGS_REG))] > > + "TARGET_STV && TARGET_SSE4_1 > > + && can_create_pseudo_p ()" > > + "#" > > + "&& 1" > > + [(set (reg:CCGC FLAGS_REG) > > + (compare:CCGC (match_dup 1)(match_dup 2))) > > + (set (match_dup 0) > > + (if_then_else:SWI48 > > + (<smaxmin_rel> (reg:CCGC FLAGS_REG)(const_int 0)) > > + (match_dup 1) > > + (match_dup 2)))]) > > + > > The pattern could in theory be matched after the last pre-RA split pass > has run, so I think the pattern still needs to have constraints and be > matchable even without can_create_pseudo_p. It looks like the split > above should work post-RA. > > A bit pedantic, because the pattern's probably fine in practice...
Currently, all unmatched STV patterns split before reload, and there were no problems. If the pattern matches after last pre-RA split, then the post-reload splitter will fail, since can_create_pseudo_p also applies to the part that splits the insn. In any case, thanks for the heads-up, hopefully we didn't assume something that doesn't hold. Thanks, Uros. > Thanks, > Richard > > > ;; Conditional addition patterns > > (define_expand "add<mode>cc" > > [(match_operand:SWI 0 "register_operand")