https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244
Bug ID: 92244 Summary: extra sub inside vectorized loop instead of calculating end-pointer Product: gcc Version: 10.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- We get a redundant instruction inside the vectorized loop here. But it's not a separate *counter*, it's a duplicate of the tail pointer. It goes away if we find tail with while(*tail++); instead of calculating it from head+length. Only happens with vectorization, not pure scalar (bug 92243 is about the fact that -O3 fails to use bswap as a GP-integer shuffle to auto-vectorize without x86 SSSE3). typedef char swapt; void strrev_explicit(swapt *head, long len) { swapt *tail = head + len - 1; for( ; head < tail; ++head, --tail) { swapt h = *head, t = *tail; *head = t; *tail = h; } } https://godbolt.org/z/wdGv4S compiled with g++ -O3 -march=sandybridge gives us a main loop of ... movq %rcx, %rsi # RSI = RCX before entering the loop addq %rdi, %r8 .L4: vmovdqu (%rcx), %xmm3 # tail load from RCX addq $16, %rax # head subq $16, %rcx # tail subq $16, %rsi # 2nd tail? vmovdqu -16(%rax), %xmm0 vpshufb %xmm2, %xmm3, %xmm1 vmovups %xmm1, -16(%rax) vpshufb %xmm2, %xmm0, %xmm0 vmovups %xmm0, 16(%rsi) # tail store to RSI cmpq %r8, %rax # } while(head != end_head) jne .L4 RSI = RCX before and after the loop. This is obviously pointless. head uses the same register for loads and stores. Then we have bloated fully-unrolled scalar cleanup, instead of using the shuffle control for 8-byte vectors -> movhps. Or scalar bswap. Ideally we'd do something clever at the overlap like one load + shuffle + store, but we might have to load the next vector before storing the current to make this work at the overlap. That would presumably require more special-casing this kind of meet-in-the-middle loop. ---- The implicit-length version doesn't have this extra sub in the main loop. void strrev_implicit(swapt *head) { swapt *tail = head; while(*tail) ++tail; // find the 0 terminator, like head+strlen --tail; // tail points to the last real char for( ; head < tail; ++head, --tail) { swapt h = *head, t = *tail; *head = t; *tail = h; } } .L22: vmovdqu (%rcx), %xmm3 addq $16, %rdx # head subq $16, %rcx # tail vmovdqu -16(%rdx), %xmm0 vpshufb %xmm2, %xmm3, %xmm1 vmovups %xmm1, -16(%rdx) vpshufb %xmm2, %xmm0, %xmm0 vmovups %xmm0, 16(%rcx) cmpq %rsi, %rdx # } while(head != end_head) jne .L22