https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100257

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed.  We fail to elide the 'pixel' temporary, that is, express

  memcpy (&pixel, src_33, 6);
  _1 = pixel.b;
  _2 = pixel.g;
  _3 = pixel.r;

in terms of loads from src.  Then the backend intrinsic expanding to
a target builtin of course does not help things.

For the above we'd need SRA-like analysis, while VN could remat the loads
from src it lacks the global costing that would tell it that all uses of
pixel and thus the memcpy goes away.

We can fold the memcpy to

  __MEM <unsigned char[6], 16> ((char * {ref-all})&pixel) = __MEM <unsigned
char[6]> ((char * {ref-all})src_13);

with the following:

diff --git a/gcc/gimple-fold.c b/gcc/gimple-fold.c
index 281839d4a73..750a5d884a7 100644
--- a/gcc/gimple-fold.c
+++ b/gcc/gimple-fold.c
@@ -1215,9 +1215,7 @@ gimple_fold_builtin_memory_op (gimple_stmt_iterator *gsi,
       if (TREE_CODE (dest) == ADDR_EXPR
          && var_decl_component_p (TREE_OPERAND (dest, 0))
          && tree_int_cst_equal (TYPE_SIZE_UNIT (desttype), len)
-         && dest_align >= TYPE_ALIGN (desttype)
-         && (is_gimple_reg_type (desttype)
-             || src_align >= TYPE_ALIGN (desttype)))
+         && dest_align >= TYPE_ALIGN (desttype))
        destvar = fold_build2 (MEM_REF, desttype, dest, off0);
       else if (TREE_CODE (src) == ADDR_EXPR
               && var_decl_component_p (TREE_OPERAND (src, 0))

and then end up with

  pixel$r_3 = MEM <const uint16_t> [(char * {ref-all})src_34];
  pixel$g_2 = MEM <const uint16_t> [(char * {ref-all})src_34 + 2B];
  pixel$b_1 = MEM <const uint16_t> [(char * {ref-all})src_34 + 4B];
  val_2.0_19 = (short int) pixel$b_1;
  val_1.1_20 = (short int) pixel$g_2;
  val_0.2_21 = (short int) pixel$r_3;
  _22 = {val_0.2_21, val_1.1_20, val_2.0_19, 0, 0, 0, 0, 0};
  _23 = __builtin_ia32_vcvtph2ps (_22);

but that doesn't help in the end.  It does help vectorizing the loop
when you avoid the intrinsic by doing

   for (unsigned x = 0; x < width; x += 1) {
        const struct util_format_r16g16b16_float pixel;
        memcpy(&pixel, src, sizeof pixel);

        struct float3 r;// = _mesa_half3_to_float3(pixel.r, pixel.g, pixel.b);
        r.f1 = pixel.r;
        r.f2 = pixel.g;
        r.f3 = pixel.b;
        dst[0] = r.f1; /* r */
        dst[1] = r.f2; /* g */
        dst[2] = r.f3; /* b */
        dst[3] = 1; /* a */

        src += 6;
        dst += 4;
   }

then we vectorize it as

  vect_pixel_r_25.30_283 = MEM <const vector(8) short unsigned int> [(char *
{ref-all})vectp_src.29_278];
  vect_pixel_r_25.31_285 = MEM <const vector(8) short unsigned int> [(char *
{ref-all})vectp_src.29_278 + 16B];
  vect_pixel_r_25.32_287 = MEM <const vector(8) short unsigned int> [(char *
{ref-all})vectp_src.29_278 + 32B];
  vect__8.33_288 = [vec_unpack_float_lo_expr] vect_pixel_r_25.30_283;
  vect__8.33_289 = [vec_unpack_float_hi_expr] vect_pixel_r_25.30_283;
  vect__8.33_290 = [vec_unpack_float_lo_expr] vect_pixel_r_25.31_285;
  vect__8.33_291 = [vec_unpack_float_hi_expr] vect_pixel_r_25.31_285;
  vect__8.33_292 = [vec_unpack_float_lo_expr] vect_pixel_r_25.32_287;
  vect__8.33_293 = [vec_unpack_float_hi_expr] vect_pixel_r_25.32_287;

but then we somehow mess up analysis of the stores going for hybrid SLP
(we split the store group).  We could just leave the dst[3] stores
unvectorized ... but then we somehow decide to emit

        vpmovzxwd       (%rdx), %ymm11
...
        vcvtdq2ps       %ymm11, %ymm11

and thus use a different conversion path (not sure if it is worse in the
end, but ...).

Reply via email to