https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100257
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- Confirmed. We fail to elide the 'pixel' temporary, that is, express memcpy (&pixel, src_33, 6); _1 = pixel.b; _2 = pixel.g; _3 = pixel.r; in terms of loads from src. Then the backend intrinsic expanding to a target builtin of course does not help things. For the above we'd need SRA-like analysis, while VN could remat the loads from src it lacks the global costing that would tell it that all uses of pixel and thus the memcpy goes away. We can fold the memcpy to __MEM <unsigned char[6], 16> ((char * {ref-all})&pixel) = __MEM <unsigned char[6]> ((char * {ref-all})src_13); with the following: diff --git a/gcc/gimple-fold.c b/gcc/gimple-fold.c index 281839d4a73..750a5d884a7 100644 --- a/gcc/gimple-fold.c +++ b/gcc/gimple-fold.c @@ -1215,9 +1215,7 @@ gimple_fold_builtin_memory_op (gimple_stmt_iterator *gsi, if (TREE_CODE (dest) == ADDR_EXPR && var_decl_component_p (TREE_OPERAND (dest, 0)) && tree_int_cst_equal (TYPE_SIZE_UNIT (desttype), len) - && dest_align >= TYPE_ALIGN (desttype) - && (is_gimple_reg_type (desttype) - || src_align >= TYPE_ALIGN (desttype))) + && dest_align >= TYPE_ALIGN (desttype)) destvar = fold_build2 (MEM_REF, desttype, dest, off0); else if (TREE_CODE (src) == ADDR_EXPR && var_decl_component_p (TREE_OPERAND (src, 0)) and then end up with pixel$r_3 = MEM <const uint16_t> [(char * {ref-all})src_34]; pixel$g_2 = MEM <const uint16_t> [(char * {ref-all})src_34 + 2B]; pixel$b_1 = MEM <const uint16_t> [(char * {ref-all})src_34 + 4B]; val_2.0_19 = (short int) pixel$b_1; val_1.1_20 = (short int) pixel$g_2; val_0.2_21 = (short int) pixel$r_3; _22 = {val_0.2_21, val_1.1_20, val_2.0_19, 0, 0, 0, 0, 0}; _23 = __builtin_ia32_vcvtph2ps (_22); but that doesn't help in the end. It does help vectorizing the loop when you avoid the intrinsic by doing for (unsigned x = 0; x < width; x += 1) { const struct util_format_r16g16b16_float pixel; memcpy(&pixel, src, sizeof pixel); struct float3 r;// = _mesa_half3_to_float3(pixel.r, pixel.g, pixel.b); r.f1 = pixel.r; r.f2 = pixel.g; r.f3 = pixel.b; dst[0] = r.f1; /* r */ dst[1] = r.f2; /* g */ dst[2] = r.f3; /* b */ dst[3] = 1; /* a */ src += 6; dst += 4; } then we vectorize it as vect_pixel_r_25.30_283 = MEM <const vector(8) short unsigned int> [(char * {ref-all})vectp_src.29_278]; vect_pixel_r_25.31_285 = MEM <const vector(8) short unsigned int> [(char * {ref-all})vectp_src.29_278 + 16B]; vect_pixel_r_25.32_287 = MEM <const vector(8) short unsigned int> [(char * {ref-all})vectp_src.29_278 + 32B]; vect__8.33_288 = [vec_unpack_float_lo_expr] vect_pixel_r_25.30_283; vect__8.33_289 = [vec_unpack_float_hi_expr] vect_pixel_r_25.30_283; vect__8.33_290 = [vec_unpack_float_lo_expr] vect_pixel_r_25.31_285; vect__8.33_291 = [vec_unpack_float_hi_expr] vect_pixel_r_25.31_285; vect__8.33_292 = [vec_unpack_float_lo_expr] vect_pixel_r_25.32_287; vect__8.33_293 = [vec_unpack_float_hi_expr] vect_pixel_r_25.32_287; but then we somehow mess up analysis of the stores going for hybrid SLP (we split the store group). We could just leave the dst[3] stores unvectorized ... but then we somehow decide to emit vpmovzxwd (%rdx), %ymm11 ... vcvtdq2ps %ymm11, %ymm11 and thus use a different conversion path (not sure if it is worse in the end, but ...).