https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80561

--- Comment #5 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 2 May 2017, glisse at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80561
> 
> --- Comment #4 from Marc Glisse <glisse at gcc dot gnu.org> ---
> Cool, that matches pretty much exactly the analysis I had posted on
> stackoverflow ;-)
> 
> A separate issue from whether we can somehow propagate the alignment
> information is what we do without the alignment information (remove the
> attribute to be sure). Gcc generates a rather large code, with scalar and
> vector loops, to try and reach an aligned position for one of the buffers (the
> other one still requires potentially unaligned access) and perform at most 2
> vector iterations. On the other hand, clang+llvm don't care about alignment 
> and
> generate unaligned vector operations, totally unrolled (that's 2 vector
> iterations since there were 8 scalar iterations initially), for a grand total
> of 6 insns (with AVX). I have a hard time believing that gcc's complicated 
> code
> is ever faster than clang's, whether the arrays are aligned or not. We can
> discuss that in a separate PR if this one should center on alignment.

The alignment peeling cost-model is somewhat simplistic but in this case
where we end up with two aligned refs we get

.L6:
        vmovupd (%rcx,%rax), %xmm0
        addl    $1, %r8d
        vinsertf128     $0x1, 16(%rcx,%rax), %ymm0, %ymm0
        vaddpd  (%r9,%rax), %ymm0, %ymm0
        vmovapd %ymm0, (%r9,%rax)
        addq    $32, %rax
        cmpl    %r10d, %r8d
        jb      .L6

vs.

.L4:
        vmovupd (%rsi,%rax), %xmm1
        addl    $1, %ecx
        vmovupd (%rdi,%rax), %xmm0
        vinsertf128     $0x1, 16(%rsi,%rax), %ymm1, %ymm1
        vinsertf128     $0x1, 16(%rdi,%rax), %ymm0, %ymm0
        vaddpd  %ymm1, %ymm0, %ymm0
        vmovups %xmm0, (%rdi,%rax)
        vextractf128    $0x1, %ymm0, 16(%rdi,%rax)
        addq    $32, %rax
        cmpl    %r8d, %ecx
        jb      .L4

with -mavx2 (and the generic tuning of splitting unaligned ymm
loads/stores).  I'm sure a microbench would show that makes
a difference.  With -mtune=intel less so I guess -- but then
the generic vectorizer cost model somewhat reflects this with
vec_unalign_load_cost of 2 and vec_align_load_cost of 1
(surprisingly there's no vec_unalgined_store_cost but it's the same as
the unaligned load one in the x86 backend...).
This should probably depend on the vector size to reflect
the splitting cost for avx sized vectors.

That is, the backend (genernic) cost model currently is too simplistic.
There's not a single tuning apart from -Os that has unaligned loads
costed the same as aligned ones.

Reply via email to