On Mon, Mar 08, 2021 at 12:04:22PM +0100, Richard Biener wrote:
> +;; Further split pinsrq variants of vec_concatv2di to hide the latency
> +;; the GPR->XMM transition(s).
> +(define_peephole2
> + [(match_scratch:DI 3 "Yv")
> + (set (match_operand:V2DI 0 "sse_reg_operand")
> + (vec_concat:V2DI (match_operand:DI 1 "sse_reg_operand")
> + (match_operand:DI 2 "nonimmediate_gr_operand")))]
> + "TARGET_64BIT && TARGET_SSE4_1
> + && !optimize_insn_for_size_p ()"
> + [(set (match_dup 3)
> + (match_dup 2))
> + (set (match_dup 0)
> + (vec_concat:V2DI (match_dup 1)
> + (match_dup 3)))])
Do we really want to do it for all vpinsrqs and not just those where
operands[1] is set from a nonimmediate_gr_operand a few instructions
earlier (or perhaps e.g. other insertions from GPRs)?
I mean, whether this is a win should depend on the latency of the
operands[1] setter if it is not too far from the vec_concat, if it has low
latency, this will only grow code without benefit, if it has high latency
it indeed could perform the GRP -> XMM move concurrently with the previous
operation.
Hardcoding the operands[1] setter in the peephole2 would mean we couldn't
match some small number of unrelated insns in between, but perhaps the
peephole2 condition could just call a function that walks the IL backward a
little and checks where the setter is and what latency it has?
Jakub