http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51509
--- Comment #2 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> 2011-12-13 09:20:54 UTC --- FWIW, uint8x8x4_t x; uint8x8x2_t y; x = vld4_dup_u8(src); y.val[0] = x.val[1]; y.val[1] = x.val[2]; vst2_lane_u8(dst, y, 0); does give the expected output. I.e. the remaining inefficiency from comment #1 is in the uninitialised parts of y. Richard