https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92772
Bug ID: 92772 Summary: wrong code vectorizing masked max Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: critical Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ams at gcc dot gnu.org Target Milestone: --- The testcase pr65947-10.c fails on amdgcn because there are more vector lanes than there is data, and the algorithm created doesn't allow for this. (Actually there's also a backend pattern missing, but I have a patch for that I'll commit shortly.) Here's the affected loop: float last = 0; for (int i = 0; i < 32; i++) if (a[i] < min_v) last = a[i]; Which produces the following code (long lines shortened). vect_cst__33 = {min_v_11(D), .... min_v_11(D)}; vect__4.16_32 = .MASK_LOAD (a_10(D), 4B, { -1, [...] -1, 0, [...] 0 }); vect_last_6.17_34 = VEC_COND_EXPR <vect__4.16_32 < vect_cst__33, vect__4.16_32, { 0.0, [...] 0.0 }>; _38 = VEC_COND_EXPR <vect__4.16_32 < vect_cst__33, { 1, 2, [...] 64 }, { 0, [...] 0 }>; _40 = .REDUC_MAX (_38); _41 = {_40, _40, [...] _40}; _43 = VEC_COND_EXPR <_38 == _41, vect_last_6.17_34, { 0.0, [...] 0.0 }>; _44 = VIEW_CONVERT_EXPR<vector(64) unsigned int>(_43); _45 = .REDUC_MAX (_44); _46 = VIEW_CONVERT_EXPR<float>(_45); return _46; In English: 1. Do a masked load of 32 elements (into 64-lane register). Loads "0.0" into the spare lanes. 2. Compare the all 64-lanes against "min_v". Label all the "true" lanes with the lane number. 3. Use a reduction to find the greatest numbered "true" lane. 4. Zero all the loaded values apart from the one in the greatest lane. 5. Use a reduction to find the value of the lane that isn't zeroed. That's slightly tortuous when we could just do a vec_extract on "_40", but that's an aside. The problem is in step 2: the spare lanes contain 0.0, which means that comparing them against "min_v" returns "true". This means that the algorithm always finds "last = a[63]" which isn't a real value and therefore always ends up being "0.0".