https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90204

--- Comment #10 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 25 Apr 2019, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90204
> 
> --- Comment #9 from Hongtao.liu <crazylht at gmail dot com> ---
> Also what's better between aligned load/store of smaller size  VS unaligned 
> load/store of bigger size?
> 
> aligned load/store of smaller size:
> 
>         movq    %rdx, (%rdi)
>         movq    -56(%rsp), %rdx
>         movq    %rdx, 8(%rdi)
>         movq    -48(%rsp), %rdx
>         movq    %rdx, 16(%rdi)
>         movq    -40(%rsp), %rdx
>         movq    %rdx, 24(%rdi)
>         vmovq   %xmm0, 32(%rax)
>         movq    -24(%rsp), %rdx
>         movq    %rdx, 40(%rdi)
>         movq    -16(%rsp), %rdx
>         movq    %rdx, 48(%rdi)
>         movq    -8(%rsp), %rdx
>         movq    %rdx, 56(%rdi)
> 
> unaligned load/store of bigger size:
> 
>         vmovups %xmm2, (%rdi)
>         vmovups %xmm3, 16(%rdi)
>         vmovups %xmm4, 32(%rdi)
>         vmovups %xmm5, 48(%rdi)

bigger stores are almost always a win while bigger loads have
the possibility to run into store-to-load forwarding issues
(and bigger stores eventually mitigate them).  Based on
CPU tuning we'd also eventually end up with mov[lh]ps splitting
unaligned loads/stores.

Reply via email to