https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65968

            Bug ID: 65968
           Summary: Failure to remove casts, cause poor code generation
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: law at redhat dot com
  Target Milestone: ---

Failure to shorten the multiplies from int mode down to their more native short
modes causes poor code generation for this loop:


void f(short*a) {
  a = __builtin_assume_aligned(a,128);
  for (int i = 0; i < (1<<22); ++i) {
#ifdef EASY
    a[i] *= a[i];
#else
    int x = a[i];
    x *= x;
    a[i] = x;
#endif
  }
}

With -DEASY, a nice little loop:
.L2:
    movdqa    (%rdi), %xmm0
    addq    $16, %rdi
    pmullw    %xmm0, %xmm0
    movaps    %xmm0, -16(%rdi)
    cmpq    %rdi, %rax
    jne    .L2

while without EASY, we get the uglier:
.L2:
    movdqa    (%rdi), %xmm0
    addq    $16, %rdi
    movdqa    %xmm0, %xmm2
    movdqa    %xmm0, %xmm1
    pmullw    %xmm0, %xmm2
    pmulhw    %xmm0, %xmm1
    movdqa    %xmm2, %xmm0
    punpckhwd    %xmm1, %xmm2
    punpcklwd    %xmm1, %xmm0
    movdqa    %xmm2, %xmm1
    movdqa    %xmm0, %xmm2
    punpcklwd    %xmm1, %xmm0
    punpckhwd    %xmm1, %xmm2
    movdqa    %xmm0, %xmm1
    punpcklwd    %xmm2, %xmm0
    punpckhwd    %xmm2, %xmm1
    punpcklwd    %xmm1, %xmm0
    movaps    %xmm0, -16(%rdi)
    cmpq    %rdi, %rax
    jne    .L2

The narrowing patterns currently in match.pd and proposed for match.pd at the
time of submitting this BZ handle plus/minus, but not multiply.  When writing
the current patterns I saw regressions when mult handling was included. 
Finding a way to avoid the regressions (should have filed BZs for them) while
still shortening for this case would be good.

Marc indicates that pattern along these lines:

(simplify
 (vec_pack_trunc (widen_mult_lo @0 @1) (widen_mult_hi:c @0 @1))
 (mult @0 @1))

Would help this specific case, but we may do better if we can do the type
narrowing before vectorization.

Reply via email to