https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89445
--- Comment #6 from Thiago Macieira <thiago at kde dot org> --- (In reply to Jakub Jelinek from comment #4) > vmovupd (%rsi,%rax), %zmm1{%k1}{z} > addq %rdx, %rax > vmovupd (%rax), %zmm2{%k1}{z} > vfmadd132pd %zmm0, %zmm2, %zmm1 > vmovupd %zmm1, (%rax){%k1} > isn't optimal btw, it would be nice if we could merge that masking into the > vfmadd132pd instruction, like: > vmovupd (%rsi,%rax), %zmm1{%k1}{z} > addq %rdx, %rax > vfmadd132pd (%rax), %zmm2, %zmm1%{k1}{z} > vmovupd %zmm1, (%rax){%k1} > but not really sure how to achieve that. It would be nice. It would be even nicer not to have that "addq". That's actually what ICC generates (click on the godbolt link and change one of the compilers to ICC 19): ..B1.3: # Preds ..B1.3 ..B1.2 cmpq %rax, %r8 #12.13 cmova %r10d, %r9d #12.13 kmovw %r9d, %k1 #13.20 vmovupd (%r8,%rsi), %zmm1{%k1}{z} #13.20 vfmadd213pd (%r8,%rdx), %zmm0, %zmm1{%k1}{z} #15.20 vmovupd %zmm1, (%r8,%rdx){%k1} #16.9 addq $64, %r8 #10.48 cmpq %rcx, %r8 #10.32 jb ..B1.3 # Prob 82% #10.32 There's one more simplification here: ICC lacks the movzbl instruction which GCC inserted but is completely superfluous. First, we've already calculated the proper 32-bit pattern and stored it in %r9d, there was no need to zero extend it. Second, when operating on 512-bit packed doubles, there are 8 lanes, so only the low 8 bits of the mask register will be considered in the first place. (Arguably, the intrinsic should have used __mmask8, but that wasn't added until AVX512DQ and this is F) That reduces the number of instructions and will save you a couple of uops per loop. Depending on how long your loop is, it may help you fit in the DSB and help the Loop Stream Detector. I'm not at all knowledgeable about those details, so I'll just link to https://stackoverflow.com/questions/39311872/is-performance-reduced-when-executing-loops-whose-uop-count-is-not-a-multiple-of#answer-39940932. For this particular loop, if run long enough, I don't think there's any effect, but this is an area for improvement for longer loops. The number of instructions is also significant for short-lived loops, which happens to me often when using SIMD for strings (tens of bytes of length, so the loop is run once or twice only).