https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65968
Bug ID: 65968 Summary: Failure to remove casts, cause poor code generation Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: law at redhat dot com Target Milestone: --- Failure to shorten the multiplies from int mode down to their more native short modes causes poor code generation for this loop: void f(short*a) { a = __builtin_assume_aligned(a,128); for (int i = 0; i < (1<<22); ++i) { #ifdef EASY a[i] *= a[i]; #else int x = a[i]; x *= x; a[i] = x; #endif } } With -DEASY, a nice little loop: .L2: movdqa (%rdi), %xmm0 addq $16, %rdi pmullw %xmm0, %xmm0 movaps %xmm0, -16(%rdi) cmpq %rdi, %rax jne .L2 while without EASY, we get the uglier: .L2: movdqa (%rdi), %xmm0 addq $16, %rdi movdqa %xmm0, %xmm2 movdqa %xmm0, %xmm1 pmullw %xmm0, %xmm2 pmulhw %xmm0, %xmm1 movdqa %xmm2, %xmm0 punpckhwd %xmm1, %xmm2 punpcklwd %xmm1, %xmm0 movdqa %xmm2, %xmm1 movdqa %xmm0, %xmm2 punpcklwd %xmm1, %xmm0 punpckhwd %xmm1, %xmm2 movdqa %xmm0, %xmm1 punpcklwd %xmm2, %xmm0 punpckhwd %xmm2, %xmm1 punpcklwd %xmm1, %xmm0 movaps %xmm0, -16(%rdi) cmpq %rdi, %rax jne .L2 The narrowing patterns currently in match.pd and proposed for match.pd at the time of submitting this BZ handle plus/minus, but not multiply. When writing the current patterns I saw regressions when mult handling was included. Finding a way to avoid the regressions (should have filed BZs for them) while still shortening for this case would be good. Marc indicates that pattern along these lines: (simplify (vec_pack_trunc (widen_mult_lo @0 @1) (widen_mult_hi:c @0 @1)) (mult @0 @1)) Would help this specific case, but we may do better if we can do the type narrowing before vectorization.