http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60575
Bug ID: 60575 Summary: inefficient vectorization of compare into bytes on amd64 Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jtaylor.debian at googlemail dot com this code comparing shorts into chars: void __attribute__((optimize("O3"))) f(char * a_, short * b_) { char * restrict a = __builtin_assume_aligned(a_, 16); short * restrict b = __builtin_assume_aligned(b_, 16); for (int i = 0; i < 1024; i++) { a[i] = 6 < b[i]; } } vectorizes with gcc 4.8.2 (gcc file.c -c -std=c99) too: 22: movdqa (%rsi,%rax,2),%xmm0 27: movdqa 0x10(%rsi,%rax,2),%xmm1 2d: pcmpgtw %xmm4,%xmm0 31: pcmpgtw %xmm4,%xmm1 35: pand %xmm3,%xmm0 39: pand %xmm3,%xmm1 3d: movdqa %xmm0,%xmm2 41: punpcklbw %xmm1,%xmm0 45: punpckhbw %xmm1,%xmm2 49: movdqa %xmm0,%xmm1 4d: punpcklbw %xmm2,%xmm0 51: punpckhbw %xmm2,%xmm1 55: movdqa %xmm0,%xmm2 59: punpcklbw %xmm1,%xmm0 5d: punpckhbw %xmm1,%xmm2 61: punpcklbw %xmm2,%xmm0 65: movdqa %xmm0,(%rdi,%rax,1) 6a: add $0x10,%rax 6e: cmp $0x400,%rax 74: jne 22 <f+0x22> which is relatively inefficient compared to using pack instructions which would look about like this (unrolled twice): b3: movdqa (%rsi,%rax,2),%xmm1 b8: movdqa 0x10(%rsi,%rax,2),%xmm0 be: pcmpgtw %xmm2,%xmm1 c2: pcmpgtw %xmm2,%xmm0 c6: packsswb %xmm0,%xmm1 ca: pand %xmm3,%xmm1 ce: movdqa %xmm1,(%rdi,%rax,1) d3: add $0x10,%rax d7: cmp $0x400,%rax dd: jne b3 <g+0x16> this can also be applied to larger sizes including floating point by adding more packs.