https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90424
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
OK, so the "easier" way to allow aligned sub-vector inserts produces for
typedef unsigned char v16qi __attribute__((vector_size(16)));
v16qi load (const void *p)
{
v16qi r;
__builtin_memcpy (&r, p, 8);
return r;
}
load (const void * p)
{
v16qi r;
long unsigned int _3;
v16qi _5;
vector(8) unsigned char _7;
<bb 2> :
_3 = MEM[(char * {ref-all})p_2(D)];
_7 = VIEW_CONVERT_EXPR<vector(8) unsigned char>(_3);
r_9 = BIT_INSERT_EXPR <r_8(D), _7, 0 (64 bits)>;
_5 = r_9;
return _5;
and unfortunately (as I feared)
load:
.LFB0:
.cfi_startproc
movq (%rdi), %rax
pxor %xmm1, %xmm1
movaps %xmm1, -24(%rsp)
movq %rax, -24(%rsp)
movdqa -24(%rsp), %xmm0
ret
via expanding to
(insn 8 7 9 2 (set (subreg:V8QI (reg:V16QI 89 [ r ]) 0)
(subreg:V8QI (reg:DI 88) 0)) "t.c":5:3 -1
(nil))
RAed from
(insn 8 7 13 2 (set (subreg:V8QI (reg:V16QI 89 [ r ]) 0)
(mem:V8QI (reg:DI 90) [0 MEM[(char * {ref-all})p_2(D)]+0 S8 A8]))
"t.c":5:3 1088 {*movv8qi_internal}
(expr_list:REG_DEAD (reg:DI 90)
(nil)))
It's still IMHO the most reasonable IL given the vector constructors
we allow.
Inserting 4 bytes is even worse though. Inserting upper 8 bytes is
like the above.
Code generation isn't worse than unpatched and the GIMPLE is clearly
better (allowing for followup optimizations).