https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92772

            Bug ID: 92772
           Summary: wrong code vectorizing masked max
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Severity: critical
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ams at gcc dot gnu.org
  Target Milestone: ---

The testcase pr65947-10.c fails on amdgcn because there are more vector lanes
than there is data, and the algorithm created doesn't allow for this. (Actually
there's also a backend pattern missing, but I have a patch for that I'll commit
shortly.)

Here's the affected loop:

 float last = 0;

 for (int i = 0; i < 32; i++)
   if (a[i] < min_v)
     last = a[i];

Which produces the following code (long lines shortened).

   vect_cst__33 = {min_v_11(D), .... min_v_11(D)};
   vect__4.16_32 = .MASK_LOAD (a_10(D), 4B, { -1, [...] -1, 0, [...] 0 });
   vect_last_6.17_34 = VEC_COND_EXPR <vect__4.16_32 < vect_cst__33,
vect__4.16_32, { 0.0, [...] 0.0 }>;
   _38 = VEC_COND_EXPR <vect__4.16_32 < vect_cst__33, { 1, 2, [...] 64 }, { 0,
[...] 0 }>;
   _40 = .REDUC_MAX (_38);
   _41 = {_40, _40, [...] _40};
   _43 = VEC_COND_EXPR <_38 == _41, vect_last_6.17_34, { 0.0, [...] 0.0 }>;
   _44 = VIEW_CONVERT_EXPR<vector(64) unsigned int>(_43);
   _45 = .REDUC_MAX (_44);
   _46 = VIEW_CONVERT_EXPR<float>(_45);
   return _46;

In English:

1. Do a masked load of 32 elements (into 64-lane register). Loads "0.0" into
the spare lanes.
2. Compare the all 64-lanes against "min_v". Label all the "true" lanes with
the lane number.
3. Use a reduction to find the greatest numbered "true" lane.
4. Zero all the loaded values apart from the one in the greatest lane.
5. Use a reduction to find the value of the lane that isn't zeroed.

That's slightly tortuous when we could just do a vec_extract on "_40", but
that's an aside.

The problem is in step 2: the spare lanes contain 0.0, which means that
comparing them against "min_v" returns "true". This means that the algorithm
always finds "last = a[63]" which isn't a real value and therefore always ends
up being "0.0".

Reply via email to