https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244

            Bug ID: 92244
           Summary: extra sub inside vectorized loop instead of
                    calculating end-pointer
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---

We get a redundant instruction inside the vectorized loop here.  But it's not a
separate *counter*, it's a duplicate of the tail pointer.

It goes away if we find tail with while(*tail++); instead of calculating it
from head+length.

Only happens with vectorization, not pure scalar (bug 92243 is about the fact
that -O3 fails to use bswap as a GP-integer shuffle to auto-vectorize without
x86 SSSE3).

typedef char swapt;
void strrev_explicit(swapt *head, long len)
{
  swapt *tail = head + len - 1;
  for( ; head < tail; ++head, --tail) {
      swapt h = *head, t = *tail;
      *head = t;
      *tail = h;
  }
}
https://godbolt.org/z/wdGv4S

compiled with g++ -O3 -march=sandybridge gives us a main loop of

        ...
        movq    %rcx, %rsi         # RSI = RCX before entering the loop
        addq    %rdi, %r8
.L4:
        vmovdqu (%rcx), %xmm3       # tail load from RCX
        addq    $16, %rax        # head
        subq    $16, %rcx        # tail
        subq    $16, %rsi        # 2nd tail?
        vmovdqu -16(%rax), %xmm0
        vpshufb %xmm2, %xmm3, %xmm1
        vmovups %xmm1, -16(%rax)
        vpshufb %xmm2, %xmm0, %xmm0
        vmovups %xmm0, 16(%rsi)     # tail store to RSI
        cmpq    %r8, %rax           # } while(head != end_head)
        jne     .L4

RSI = RCX before and after the loop.  This is obviously pointless.
head uses the same register for loads and stores.

 Then we have bloated fully-unrolled scalar cleanup, instead of using the
shuffle control for 8-byte vectors -> movhps.  Or scalar bswap.  Ideally we'd
do something clever at the overlap like one load + shuffle + store, but we
might have to load the next vector before storing the current to make this work
at the overlap.  That would presumably require more special-casing this kind of
meet-in-the-middle loop.


----

The implicit-length version doesn't have this extra sub in the main loop.

void strrev_implicit(swapt *head)
{
  swapt *tail = head;
  while(*tail) ++tail;    // find the 0 terminator, like head+strlen
  --tail;                 // tail points to the last real char
  for( ; head < tail; ++head, --tail) {
      swapt h = *head, t = *tail;
      *head = t;
      *tail = h;
  }
}

.L22:
        vmovdqu (%rcx), %xmm3
        addq    $16, %rdx           # head
        subq    $16, %rcx           # tail
        vmovdqu -16(%rdx), %xmm0
        vpshufb %xmm2, %xmm3, %xmm1
        vmovups %xmm1, -16(%rdx)
        vpshufb %xmm2, %xmm0, %xmm0
        vmovups %xmm0, 16(%rcx)
        cmpq    %rsi, %rdx          # } while(head != end_head)
        jne     .L22

Reply via email to