https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #36 from Richard Biener <rguenth at gcc dot gnu.org> ---
As additional observation for the c-ray case we end up with

  <bb 2> [local count: 1073741824]:
  vect_ray_orig_x_87.270_173 = MEM <vector(2) double> [(double *)&ray];
  _170 = BIT_FIELD_REF <vect_ray_orig_x_87.270_173, 64, 64>;
  _171 = BIT_FIELD_REF <vect_ray_orig_x_87.270_173, 64, 0>;
  # DEBUG D#93 => ray.orig.x
  # DEBUG ray$orig$x => D#93
  # DEBUG D#92 => ray.orig.y
  # DEBUG ray$orig$y => D#92
  ray$orig$z_89 = ray.orig.z;
  # DEBUG ray$orig$z => ray$orig$z_89
  vect_ray_dir_x_90.266_178 = MEM <vector(2) double> [(double *)&ray + 24B];
  _175 = BIT_FIELD_REF <vect_ray_dir_x_90.266_178, 64, 64>;
  _176 = BIT_FIELD_REF <vect_ray_dir_x_90.266_178, 64, 0>;

so we load as vector but will need both lanes for scalar code pieces we
couldn't vectorize (live lanes).  It's somewhat difficult to reverse the
vectorization decision at that point - we need the final idea on what stmts
we vectorize to compute live lanes and we need to know which operands
are vectorized to tell whether we can vectorize a stmt.  But at least
for loads we eventually could use scalar loads and a CTOR "late".

There's also code in GIMPLE forwprop that can decompose vector loads
feeding BIT_FIELD_REFs but it only does that if there's no other use of
the vector (in this case of course there is - a single for the first
and two for the second).

There is not much value in the vectorization we do in this function
(when manually fixing the STLF issue the speed is as good as with the
scalar code).  We cost

ray.dir.x 1 times scalar_load costs 12 in body
ray.dir.y 1 times scalar_load costs 12 in body

vs.

ray.dir.x 1 times unaligned_load (misalign -1) costs 12 in body
ray.dir.x 1 times vec_to_scalar costs 4 in epilogue
ray.dir.y 1 times vec_to_scalar costs 4 in epilogue

which is probably OK, with SSE it's two loads vs one load + move + unpck,
with AVX we can elide the move (but a move is free), the disadvantage of
the vector load is the higher latency on the high part (plus of course
the STLF hit).  Since the vectorizer doesn't prune individual stmts
because of costs but only throws away the whole opportunity if the
overall cost doesn't seem profitable it's difficult to optimially
handle this on the costing side I think.  Instead the vectorizer should
somehow be directed to use scalar loads + vector construction if
likely STLF fails are detected.

For example the following mitigates the issue for c-ray without resorting
to "late" adjustments via costs but instead by changing the vectorization
strathegy for possibly affected loads using target independent and
likely flawed heuristics.  A full exercise of the cummulative-args
machinery might be able to tell how (parts) of a PARM_DECL are passed.
Whether the caller will end up using wider moves with %xmm remains a guess
of course.  What's also completely missing is an idea how far from
function entry this vectorization happens - for c-ray it would be enough
to restrict this to loads in BB 2 for example.

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 5c9e8cfefa5..4f07e5ddc61 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -2197,7 +2197,24 @@ get_group_load_store_type (vec_info *vinfo,
stmt_vec_info
 stmt_info,
   /* Stores can't yet have gaps.  */
   gcc_assert (slp_node || vls_type == VLS_LOAD || gap == 0);

-  if (slp_node)
+  if (!loop_vinfo
+      && vls_type == VLS_LOAD
+      && TREE_CODE (DR_BASE_ADDRESS (first_dr_info->dr)) == ADDR_EXPR
+      && (TREE_CODE (TREE_OPERAND (DR_BASE_ADDRESS (first_dr_info->dr), 0))
+         == PARM_DECL)
+      /* Assume that for a power of two number of elements the aggregate
+        move to the stack is using larger moves at the caller side.  */
+      && !pow2p_hwi (group_size))
+    {
+      /* When doing BB vectorizing force loads from function parameters
+        (???  that are passed in memory and stored in pieces likely
+        causing STLF failures) to be done elementwise.  */
+      /* ???  Note this will cause vectorization to fail because of
+        the fear of underestimating the cost of elementwise accesses,
+        see the end of get_load_store_type.  */
+      *memory_access_type = VMAT_ELEMENTWISE;
+    }
+  else if (slp_node)
     {
       /* For SLP vectorization we directly vectorize a subchain
         without permutation.  */

Reply via email to