https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89445

--- Comment #6 from Thiago Macieira <thiago at kde dot org> ---
(In reply to Jakub Jelinek from comment #4)
>         vmovupd (%rsi,%rax), %zmm1{%k1}{z}
>         addq    %rdx, %rax
>         vmovupd (%rax), %zmm2{%k1}{z}
>         vfmadd132pd     %zmm0, %zmm2, %zmm1
>         vmovupd %zmm1, (%rax){%k1}
> isn't optimal btw, it would be nice if we could merge that masking into the
> vfmadd132pd instruction, like:
>         vmovupd (%rsi,%rax), %zmm1{%k1}{z}
>         addq    %rdx, %rax
>         vfmadd132pd     (%rax), %zmm2, %zmm1%{k1}{z}
>         vmovupd %zmm1, (%rax){%k1}
> but not really sure how to achieve that.

It would be nice. It would be even nicer not to have that "addq". That's
actually what ICC generates (click on the godbolt link and change one of the
compilers to ICC 19):

..B1.3:                         # Preds ..B1.3 ..B1.2
        cmpq      %rax, %r8                                     #12.13
        cmova     %r10d, %r9d                                   #12.13
        kmovw     %r9d, %k1                                     #13.20
        vmovupd   (%r8,%rsi), %zmm1{%k1}{z}                     #13.20
        vfmadd213pd (%r8,%rdx), %zmm0, %zmm1{%k1}{z}            #15.20
        vmovupd   %zmm1, (%r8,%rdx){%k1}                        #16.9
        addq      $64, %r8                                      #10.48
        cmpq      %rcx, %r8                                     #10.32
        jb        ..B1.3        # Prob 82%                      #10.32

There's one more simplification here: ICC lacks the movzbl instruction which
GCC inserted but is completely superfluous. First, we've already calculated the
proper 32-bit pattern and stored it in %r9d, there was no need to zero extend
it. Second, when operating on 512-bit packed doubles, there are 8 lanes, so
only the low 8 bits of the mask register will be considered in the first place.
(Arguably, the intrinsic should have used __mmask8, but that wasn't added until
AVX512DQ and this is F)

That reduces the number of instructions and will save you a couple of uops per
loop. Depending on how long your loop is, it may help you fit in the DSB and
help the Loop Stream Detector. I'm not at all knowledgeable about those
details, so I'll just link to
https://stackoverflow.com/questions/39311872/is-performance-reduced-when-executing-loops-whose-uop-count-is-not-a-multiple-of#answer-39940932.

For this particular loop, if run long enough, I don't think there's any effect,
but this is an area for improvement for longer loops. The number of
instructions is also significant for short-lived loops, which happens to me
often when using SIMD for strings (tens of bytes of length, so the loop is run
once or twice only).

Reply via email to