Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs

Richard Sandiford Mon, 05 Aug 2019 02:14:11 -0700

Uros Bizjak <[email protected]> writes:
> On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <[email protected]> wrote:
>>
>> On Thu, 1 Aug 2019, Uros Bizjak wrote:
>>
>> > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <[email protected]> wrote:
>> >
>> >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks
>> >>>> necessary even when going the STV route.  The actual regression
>> >>>> for the testcase could also be solved by turing the smaxsi3
>> >>>> back into a compare and jump rather than a conditional move sequence.
>> >>>> So I wonder how you'd do that given that there's pass_if_after_reload
>> >>>> after pass_split_after_reload and I'm not sure we can split
>> >>>> as late as pass_split_before_sched2 (there's also a split _after_
>> >>>> sched2 on x86 it seems).
>> >>>>
>> >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the
>> >>>> case STV doesn't end up doing any transform?
>> >>>
>> >>> If STV doesn't transform the insn, then a pre-reload splitter splits
>> >>> the insn back to compare+cmove.
>> >>
>> >> OK, that would work.  But there's no way to force a jumpy sequence then
>> >> which we know is faster than compare+cmove because later RTL
>> >> if-conversion passes happily re-discover the smax (or conditional move)
>> >> sequence.
>> >>
>> >>> However, considering the SImode move
>> >>> from/to int/xmm register is relatively cheap, the cost function should
>> >>> be tuned so that STV always converts smaxsi3 pattern.
>> >>
>> >> Note that on both Zen and even more so bdverN the int/xmm transition
>> >> makes it no longer profitable but a _lot_ slower than the cmp/cmov
>> >> sequence... (for the loop in hmmer which is the only one I see
>> >> any effect of any of my patches).  So identifying chains that
>> >> start/end in memory is important for cost reasons.
>> >
>> > Please note that the cost function also considers the cost of move
>> > from/to xmm. So, the cost of the whole chain would disable the
>> > transformation.
>> >
>> >> So I think the splitting has to happen after the last if-conversion
>> >> pass (and thus we may need to allocate a scratch register for this
>> >> purpose?)
>> >
>> > I really hope that the underlying issue will be solved by a machine
>> > dependant pass inserted somewhere after the pre-reload split. This
>> > way, we can split unconverted smax to the cmove, and this later pass
>> > would handle jcc and cmove instructions. Until then... yes your
>> > proposed approach is one of the ways to avoid unwanted if-conversion,
>> > although sometimes we would like to split to cmove instead.
>>
>> So the following makes STV also consider SImode chains, re-using the
>> DImode chain code.  I've kept a simple incomplete smaxsi3 pattern
>> and also did not alter the {SI,DI}mode chain cost function - it's
>> quite off for TARGET_64BIT.  With this I get the expected conversion
>> for the testcase derived from hmmer.
>>
>> No further testing sofar.
>>
>> Is it OK to re-use the DImode chain code this way?  I'll clean things
>> up some more of course.
>
> Yes, the approach looks OK to me. It makes chain building mode
> agnostic, and the chain building can be used for
> a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added.
> b) SImode x86_32 and x86_64 (this will be mainly used for SImode
> minmax and surrounding SImode operations)
> c) DImode x86_64 (also, mainly used for DImode minmax and surrounding
> DImode operations)
>
>> Still need help with the actual patterns for minmax and how the splitters
>> should look like.
>
> Please look at the attached patch. Maybe we can add memory_operand as
> operand 1 and operand 2 predicate, but let's keep things simple for
> now.
>
> Uros.
>
> Index: i386.md
> ===================================================================
> --- i386.md   (revision 274008)
> +++ i386.md   (working copy)
> @@ -17721,6 +17721,27 @@
>      std::swap (operands[4], operands[5]);
>  })
>  
> +;; min/max patterns
> +
> +(define_code_attr smaxmin_rel [(smax "ge") (smin "le")])
> +
> +(define_insn_and_split "<code><mode>3"
> +  [(set (match_operand:SWI48 0 "register_operand")
> +     (smaxmin:SWI48 (match_operand:SWI48 1 "register_operand")
> +                    (match_operand:SWI48 2 "register_operand")))
> +   (clobber (reg:CC FLAGS_REG))]
> +  "TARGET_STV && TARGET_SSE4_1
> +   && can_create_pseudo_p ()"
> +  "#"
> +  "&& 1"
> +  [(set (reg:CCGC FLAGS_REG)
> +     (compare:CCGC (match_dup 1)(match_dup 2)))
> +   (set (match_dup 0)
> +     (if_then_else:SWI48
> +       (<smaxmin_rel> (reg:CCGC FLAGS_REG)(const_int 0))
> +       (match_dup 1)
> +       (match_dup 2)))])
> +


The pattern could in theory be matched after the last pre-RA split pass
has run, so I think the pattern still needs to have constraints and be
matchable even without can_create_pseudo_p.  It looks like the split
above should work post-RA.

A bit pedantic, because the pattern's probably fine in practice...

Thanks,
Richard

>  ;; Conditional addition patterns
>  (define_expand "add<mode>cc"
>    [(match_operand:SWI 0 "register_operand")

Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs

Reply via email to