https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90204
--- Comment #10 from rguenther at suse dot de <rguenther at suse dot de> --- On Thu, 25 Apr 2019, crazylht at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90204 > > --- Comment #9 from Hongtao.liu <crazylht at gmail dot com> --- > Also what's better between aligned load/store of smaller size VS unaligned > load/store of bigger size? > > aligned load/store of smaller size: > > movq %rdx, (%rdi) > movq -56(%rsp), %rdx > movq %rdx, 8(%rdi) > movq -48(%rsp), %rdx > movq %rdx, 16(%rdi) > movq -40(%rsp), %rdx > movq %rdx, 24(%rdi) > vmovq %xmm0, 32(%rax) > movq -24(%rsp), %rdx > movq %rdx, 40(%rdi) > movq -16(%rsp), %rdx > movq %rdx, 48(%rdi) > movq -8(%rsp), %rdx > movq %rdx, 56(%rdi) > > unaligned load/store of bigger size: > > vmovups %xmm2, (%rdi) > vmovups %xmm3, 16(%rdi) > vmovups %xmm4, 32(%rdi) > vmovups %xmm5, 48(%rdi) bigger stores are almost always a win while bigger loads have the possibility to run into store-to-load forwarding issues (and bigger stores eventually mitigate them). Based on CPU tuning we'd also eventually end up with mov[lh]ps splitting unaligned loads/stores.