[Bug tree-optimization/60575] New: inefficient vectorization of compare into bytes on amd64

jtaylor.debian at googlemail dot com Tue, 18 Mar 2014 14:41:37 -0700

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60575


            Bug ID: 60575
           Summary: inefficient vectorization of compare into bytes on
                    amd64
           Product: gcc
           Version: 4.8.0
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jtaylor.debian at googlemail dot com

this code comparing shorts into chars:

void __attribute__((optimize("O3"))) f(char * a_, short * b_) 
{
    char * restrict a = __builtin_assume_aligned(a_, 16);
    short * restrict b = __builtin_assume_aligned(b_, 16);
    for (int i = 0; i < 1024; i++) {
        a[i] = 6 < b[i];
    }   
}

vectorizes with gcc 4.8.2 (gcc file.c -c -std=c99) too:

  22:    movdqa (%rsi,%rax,2),%xmm0
  27:    movdqa 0x10(%rsi,%rax,2),%xmm1
  2d:    pcmpgtw %xmm4,%xmm0
  31:    pcmpgtw %xmm4,%xmm1
  35:    pand   %xmm3,%xmm0
  39:    pand   %xmm3,%xmm1
  3d:    movdqa %xmm0,%xmm2
  41:    punpcklbw %xmm1,%xmm0
  45:    punpckhbw %xmm1,%xmm2
  49:    movdqa %xmm0,%xmm1
  4d:    punpcklbw %xmm2,%xmm0
  51:    punpckhbw %xmm2,%xmm1
  55:    movdqa %xmm0,%xmm2
  59:    punpcklbw %xmm1,%xmm0
  5d:    punpckhbw %xmm1,%xmm2
  61:    punpcklbw %xmm2,%xmm0
  65:    movdqa %xmm0,(%rdi,%rax,1)
  6a:    add    $0x10,%rax
  6e:    cmp    $0x400,%rax
  74:    jne    22 <f+0x22>

which is relatively inefficient compared to using pack instructions which would
look about like this (unrolled twice):

  b3:    movdqa (%rsi,%rax,2),%xmm1
  b8:    movdqa 0x10(%rsi,%rax,2),%xmm0
  be:    pcmpgtw %xmm2,%xmm1
  c2:    pcmpgtw %xmm2,%xmm0
  c6:    packsswb %xmm0,%xmm1
  ca:    pand   %xmm3,%xmm1
  ce:    movdqa %xmm1,(%rdi,%rax,1)
  d3:    add    $0x10,%rax
  d7:    cmp    $0x400,%rax
  dd:    jne    b3 <g+0x16>

this can also be applied to larger sizes including floating point by adding
more packs.

[Bug tree-optimization/60575] New: inefficient vectorization of compare into bytes on amd64

Reply via email to