https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80561
--- Comment #5 from rguenther at suse dot de <rguenther at suse dot de> --- On Tue, 2 May 2017, glisse at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80561 > > --- Comment #4 from Marc Glisse <glisse at gcc dot gnu.org> --- > Cool, that matches pretty much exactly the analysis I had posted on > stackoverflow ;-) > > A separate issue from whether we can somehow propagate the alignment > information is what we do without the alignment information (remove the > attribute to be sure). Gcc generates a rather large code, with scalar and > vector loops, to try and reach an aligned position for one of the buffers (the > other one still requires potentially unaligned access) and perform at most 2 > vector iterations. On the other hand, clang+llvm don't care about alignment > and > generate unaligned vector operations, totally unrolled (that's 2 vector > iterations since there were 8 scalar iterations initially), for a grand total > of 6 insns (with AVX). I have a hard time believing that gcc's complicated > code > is ever faster than clang's, whether the arrays are aligned or not. We can > discuss that in a separate PR if this one should center on alignment. The alignment peeling cost-model is somewhat simplistic but in this case where we end up with two aligned refs we get .L6: vmovupd (%rcx,%rax), %xmm0 addl $1, %r8d vinsertf128 $0x1, 16(%rcx,%rax), %ymm0, %ymm0 vaddpd (%r9,%rax), %ymm0, %ymm0 vmovapd %ymm0, (%r9,%rax) addq $32, %rax cmpl %r10d, %r8d jb .L6 vs. .L4: vmovupd (%rsi,%rax), %xmm1 addl $1, %ecx vmovupd (%rdi,%rax), %xmm0 vinsertf128 $0x1, 16(%rsi,%rax), %ymm1, %ymm1 vinsertf128 $0x1, 16(%rdi,%rax), %ymm0, %ymm0 vaddpd %ymm1, %ymm0, %ymm0 vmovups %xmm0, (%rdi,%rax) vextractf128 $0x1, %ymm0, 16(%rdi,%rax) addq $32, %rax cmpl %r8d, %ecx jb .L4 with -mavx2 (and the generic tuning of splitting unaligned ymm loads/stores). I'm sure a microbench would show that makes a difference. With -mtune=intel less so I guess -- but then the generic vectorizer cost model somewhat reflects this with vec_unalign_load_cost of 2 and vec_align_load_cost of 1 (surprisingly there's no vec_unalgined_store_cost but it's the same as the unaligned load one in the x86 backend...). This should probably depend on the vector size to reflect the splitting cost for avx sized vectors. That is, the backend (genernic) cost model currently is too simplistic. There's not a single tuning apart from -Os that has unaligned loads costed the same as aligned ones.