https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118182
Bug ID: 118182 Summary: RISC-V: Miscompile for 410.bwaves, 416.gamess and 465.tonto from spec2006 Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: kito at gcc dot gnu.org Target Milestone: --- Created attachment 59955 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59955&action=edit Reduced testcase from 410.bwaves 410.bwaves, 416.gamess and 465.tonto from spec2006 got miscompare when compile with just `-O3 -march=rv64gcv`, and we found the that cause by reduction sum. And the root cause is because reduction sum may execute with VL=0, and our generated code will not work as expect if VL=0 (but work well if VL > 0). The current code gen for reduction sum: ``` # _148 = .MASK_LEN_FOLD_LEFT_PLUS (stmp__148.33_134, vect__70.32_138, { -1, ... }, loop_len_161, 0); vsetvli zero,a5,e64,m1,ta,ma vfmv.s.f v2,fa5 vfredosum.vs v1,v1,v2 vfmv.f.s fa5,v1 ``` And here is detail analysis for why it not work as expect if VL = 0 1. vfmv.s.f v2,fa5 vfmv.s.f won't do anything if VL=0, which means v2 will contain garbage value. 2. vfredosum.vs v1,v1,v2 vfredosum.vs won't do anything if VL=0, and keep vd unchanged even TA. (spec say: If vl=0, no operation is performed and the destination register is not updated.) 3. vfmv.f.s fa5,v1 vfmv.f.s will move the value from v1 even VL=0, so this is safe. --- The main root cause is come from following loop: ``` ! Compute the sum of squares of dq elements do k = 1, nz do j = 1, ny do i = 1, nx do l = 1, 5 dqnorm = dqnorm + dq(l, i, j, k) * dq(l, i, j, k) end do end do end do end do ``` And it will generating something like below: ``` # ivtmp.48_106 = PHI <ivtmp.48_107(9), _84(8)> _128 = MIN_EXPR <ivtmp.41_49, POLY_INT_CST [10, 10]>; loop_len_165 = MIN_EXPR <_128, POLY_INT_CST [2, 2]>; _122 = _128 - loop_len_165; loop_len_164 = MIN_EXPR <_122, POLY_INT_CST [2, 2]>; _121 = _122 - loop_len_164; loop_len_163 = MIN_EXPR <_121, POLY_INT_CST [2, 2]>; _120 = _121 - loop_len_163; loop_len_162 = MIN_EXPR <_120, POLY_INT_CST [2, 2]>; loop_len_161 = _120 - loop_len_162; _125 = (void *) ivtmp.44_60; ... stmp__148.33_137 = .MASK_LEN_FOLD_LEFT_PLUS (dqnorm_lsm.20_20, vect__70.32_150, { -1, ... }, loop_len_165, 0); stmp__148.33_136 = .MASK_LEN_FOLD_LEFT_PLUS (stmp__148.33_137, vect__70.32_147, { -1, ... }, loop_len_164, 0); stmp__148.33_135 = .MASK_LEN_FOLD_LEFT_PLUS (stmp__148.33_136, vect__70.32_142, { -1, ... }, loop_len_163, 0); stmp__148.33_134 = .MASK_LEN_FOLD_LEFT_PLUS (stmp__148.33_135, vect__70.32_140, { -1, ... }, loop_len_162, 0); _148 = .MASK_LEN_FOLD_LEFT_PLUS (stmp__148.33_134, vect__70.32_138, { -1, ... }, loop_len_161, 0); ... ``` NOTE: I have a candidate patch to fix that, but just need few more refinement to make it become upstream-able quality