On Mon, 8 Mar 2021, Jakub Jelinek wrote:
> On Mon, Mar 08, 2021 at 12:04:22PM +0100, Richard Biener wrote:
> > +;; Further split pinsrq variants of vec_concatv2di to hide the latency
> > +;; the GPR->XMM transition(s).
> > +(define_peephole2
> > + [(match_scratch:DI 3 "Yv")
> > + (set (match_operand:V2DI 0 "sse_reg_operand")
> > + (vec_concat:V2DI (match_operand:DI 1 "sse_reg_operand")
> > + (match_operand:DI 2 "nonimmediate_gr_operand")))]
> > + "TARGET_64BIT && TARGET_SSE4_1
> > + && !optimize_insn_for_size_p ()"
> > + [(set (match_dup 3)
> > + (match_dup 2))
> > + (set (match_dup 0)
> > + (vec_concat:V2DI (match_dup 1)
> > + (match_dup 3)))])
>
> Do we really want to do it for all vpinsrqs and not just those where
> operands[1] is set from a nonimmediate_gr_operand a few instructions
> earlier (or perhaps e.g. other insertions from GPRs)?
> I mean, whether this is a win should depend on the latency of the
> operands[1] setter if it is not too far from the vec_concat, if it has low
> latency, this will only grow code without benefit, if it has high latency
> it indeed could perform the GRP -> XMM move concurrently with the previous
> operation.
In principle even a single high-latency op (GPR or MEM) is worth splitting
if scheduling can move it enough. What "enough" is of course depends
on the CPU and its OOO issue capabilities. It's not exactly clear
what the uops are but I suspect it's MEM load or GPR->xmm move, which
would mean the dependency on the first slow op should be mitigated by
OOO issueing as well.
OK, so I threw the following at IACA:
.L3:
movl $111, %ebx
.byte 0x64, 0x67, 0x90
# movq %rdi, %xmm0
# vpinsrq $1, %rax, %xmm0, %xmm0
# movq %rdi, %xmm1
# vpinsrq $1, %rax, %xmm1, %xmm1
movq %rdi, %xmm0
movq %rax, %xmm3
vpunpcklqdq %xmm3, %xmm0, %xmm0
movq %rdi, %xmm1
movq %rax, %xmm4
vpunpcklqdq %xmm4, %xmm1, %xmm1
vpaddq %xmm0, %xmm2, %xmm2
vpaddq %xmm1, %xmm2, %xmm2
addl $1, %edx
cmpl %edx, %eax
movl $222, %ebx
.byte 0x64, 0x67, 0x90
jne .L3
and the commented vs. the uncommented sequences make no difference in its
analysis of the loop throughput (but there's no work to be issued
between the slow moves and the unpack) - IACA also doesn't seem to
OOO execute here (the first vpaddq is only issued after the 2nd unpck).
Going back to the Botan case there's no speed difference between the
following which has the spilling removed and the vpinsr removed:
vmovdqu (%rsi), %xmm4
movq 8(%rsi), %rdx
shrq $63, %rdx
imulq $135, %rdx, %rdi
movq (%rsi), %rdx
vmovq %rdi, %xmm0
vpsllq $1, %xmm4, %xmm1
shrq $63, %rdx
vmovq %rdx, %xmm5
vpunpcklqdq %xmm0, %xmm5, %xmm0
vpxor %xmm1, %xmm0, %xmm0
vmovdqu %xmm0, (%rax)
to the variant with just the spilling removed:
vmovdqu (%rsi), %xmm4
movq 8(%rsi), %rdx
shrq $63, %rdx
imulq $135, %rdx, %rdi
movq (%rsi), %rdx
vmovq %rdi, %xmm0
vpsllq $1, %xmm4, %xmm1
shrq $63, %rdx
vpinsrq $1, %rdx, %xmm0, %xmm0
vpxor %xmm1, %xmm0, %xmm0
vmovdqu %xmm0, (%rax)
so it looks I was barking at the wrong tree here and OOO issueing
works fine here (the patch might still make a difference on in-order
micro architectures or in situations not tested).
Thus I'll drop it for now.
That said, the spilling issue is way harder to address I fear. For
reference, this was the original:
vmovdqu (%rsi), %xmm4
vmovdqa %xmm4, 16(%rsp)
movq 24(%rsp), %rdx
vmovdqa 16(%rsp), %xmm5
shrq $63, %rdx
imulq $135, %rdx, %rdi
movq 16(%rsp), %rdx
vmovq %rdi, %xmm0
vpsllq $1, %xmm5, %xmm1
shrq $63, %rdx
vpinsrq $1, %rdx, %xmm0, %xmm0
vpxor %xmm1, %xmm0, %xmm0
vmovdqu %xmm0, (%rax)
jmp .L53
and the important part is to do the loads to %rdx/%rdi via
(%rsi) and not the spill slot. Performing more "clever"
reloads from %xmm4 like via
vmovdqu (%rsi), %xmm4
vmovhlps %xmm4, %xmm4, %xmm5
movq %xmm5, %rdx
vmovdqa 16(%rsp), %xmm5
shrq $63, %rdx
imulq $135, %rdx, %rdi
movq %xmm4, %rdx
vmovq %rdi, %xmm0
vpsllq $1, %xmm4, %xmm1
shrq $63, %rdx
vpinsrq $1, %rdx, %xmm0, %xmm0
vpxor %xmm1, %xmm0, %xmm0
vmovdqu %xmm0, (%rax)
jmp .L53
only improves things marginally.
Richard.
> Hardcoding the operands[1] setter in the peephole2 would mean we couldn't
> match some small number of unrelated insns in between, but perhaps the
> peephole2 condition could just call a function that walks the IL backward a
> little and checks where the setter is and what latency it has?
>
> Jakub
>
>
--
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)